The Independent Eval Harness.
Skill by Skill.
Real repos. Real tools. Deterministic rules. Ensemble judges with recusal. No vendor bias.
3 Models Tested
4 Skills
12 Total Runs
Latest Evals
Frontier Model Evals
Three frontier models, five agentic skills — all pass 100%
Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro on five agentic coding skills with iterative test execution. Opus won quality at 0.979 for $10.33. Gemini won value at 0.952 for $2.98. GPT-5.4 sat in the middle — and used 2-3× more tool rounds on concurrency work, a cost the quality scores alone hide.
Read the full report →How Scoring Works
1
Generate. Each model gets the same prompt —
skill instructions plus test input — and produces its complete
output in one shot. Temperature 0.0.
2
Rules (60%). Deterministic pattern matching checks
whether the output contains required elements. Binary, reproducible,
free to run.
3
Ensemble Judge (40%). Three frontier models
independently grade each output on relevance, completeness, and
accuracy. Provider recusal prevents self-grading bias. Median score wins.
Stay ahead of the AI curve
New benchmarks, model insights, and AI engineering deep dives.