The Independent Eval Harness.
Skill by Skill.

Real repos. Real tools. Deterministic rules. Ensemble judges with recusal. No vendor bias.

3 Models Tested
4 Skills
12 Total Runs

Latest Evals

Frontier Model Evals

Three frontier models, five agentic skills — all pass 100%

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro on five agentic coding skills with iterative test execution. Opus won quality at 0.979 for $10.33. Gemini won value at 0.952 for $2.98. GPT-5.4 sat in the middle — and used 2-3× more tool rounds on concurrency work, a cost the quality scores alone hide.

Read the full report →

How Scoring Works

1
Generate. Each model gets the same prompt — skill instructions plus test input — and produces its complete output in one shot. Temperature 0.0.
2
Rules (60%). Deterministic pattern matching checks whether the output contains required elements. Binary, reproducible, free to run.
3
Ensemble Judge (40%). Three frontier models independently grade each output on relevance, completeness, and accuracy. Provider recusal prevents self-grading bias. Median score wins.

Stay ahead of the AI curve

New benchmarks, model insights, and AI engineering deep dives.