Independent AI
Benchmarks
Models, skills, and tasks tested head-to-head. Deterministic rules. Ensemble judges. No vendor bias.
3 Models Tested
4 Skills
12 Total Runs
Latest Evals
Frontier Model Evals
Claude Opus 4.6 scores highest at 0.825
Three frontier models, four skills, ensemble judges. Claude Opus led at 0.825 but cost $1.03. GPT-5.4 scored 0.814 for $0.48. Gemini was cheapest at $0.31 but the only model to miss a pass threshold. The gap between first and third: 0.057 points.
Read the full report →How Scoring Works
1
Generate. Each model gets the same prompt —
skill instructions plus test input — and produces its complete
output in one shot. Temperature 0.0.
2
Rules (60%). Deterministic pattern matching checks
whether the output contains required elements. Binary, reproducible,
free to run.
3
Ensemble Judge (40%). Three frontier models
independently grade each output on relevance, completeness, and
accuracy. Provider recusal prevents self-grading bias. Median score wins.
Stay ahead of the AI curve
New benchmarks, model insights, and AI engineering deep dives.