Independent AI
Benchmarks

Models, skills, and tasks tested head-to-head. Deterministic rules. Ensemble judges. No vendor bias.

3 Models Tested

4 Skills

12 Total Runs

Latest Evals

Frontier Model Evals

Claude Opus 4.6 scores highest at 0.825

Three frontier models, four skills, ensemble judges. Claude Opus led at 0.825 but cost $1.03. GPT-5.4 scored 0.814 for $0.48. Gemini was cheapest at $0.31 but the only model to miss a pass threshold. The gap between first and third: 0.057 points.

Read the full report →

How Scoring Works

Generate. Each model gets the same prompt — skill instructions plus test input — and produces its complete output in one shot. Temperature 0.0.

Rules (60%). Deterministic pattern matching checks whether the output contains required elements. Binary, reproducible, free to run.

Ensemble Judge (40%). Three frontier models independently grade each output on relevance, completeness, and accuracy. Provider recusal prevents self-grading bias. Median score wins.

Stay ahead of the AI curve

New benchmarks, model insights, and AI engineering deep dives.

View All Benchmarks Subscribe

Independent AIBenchmarks

Latest Evals

Claude Opus 4.6 scores highest at 0.825

How Scoring Works

Stay ahead of the AI curve

Independent AI
Benchmarks