The Independent Eval Harness.
Skill by Skill.

Real repos. Real tools. Deterministic rules. Ensemble judges with recusal. No vendor bias.

3 Models Tested

4 Skills

12 Total Runs

Latest Evals

Frontier Model Evals

Three frontier models, five agentic skills — all pass 100%

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro on five agentic coding skills with iterative test execution. Opus won quality at 0.979 for $10.33. Gemini won value at 0.952 for $2.98. GPT-5.4 sat in the middle — and used 2-3× more tool rounds on concurrency work, a cost the quality scores alone hide.

Read the full report →

How Scoring Works

Generate. Each model gets the same prompt — skill instructions plus test input — and produces its complete output in one shot. Temperature 0.0.

Rules (60%). Deterministic pattern matching checks whether the output contains required elements. Binary, reproducible, free to run.

Ensemble Judge (40%). Three frontier models independently grade each output on relevance, completeness, and accuracy. Provider recusal prevents self-grading bias. Median score wins.

Stay ahead of the AI curve

New benchmarks, model insights, and AI engineering deep dives.

View All Benchmarks Subscribe

The Independent Eval Harness.Skill by Skill.

Latest Evals

Three frontier models, five agentic skills — all pass 100%

How Scoring Works

Stay ahead of the AI curve

The Independent Eval Harness.
Skill by Skill.