Frontier Model Evals March 8, 2026

Frontier Model Showdown

Claude Opus 4.6 vs GPT-5.4 vs Gemini 3.1 Pro

GPT-5.4 dropped yesterday. That means we now have fresh frontier models from all three major providers sitting side by side. So we did what any reasonable person would do: we built a testing rig and ran them head-to-head on real software engineering tasks.

This isn't a vibes-based comparison. Every model got the exact same prompts, was evaluated by the same deterministic rules, and was scored by an ensemble of all three frontier models acting as judges. No model grades its own homework unchecked.

The Setup

Model	Provider	Release
Claude Opus 4.6	Anthropic	Current
GPT-5.4	OpenAI	March 5, 2026
Gemini 3.1 Pro	Google	Current

Frontend Design

Responsive dashboard with animations, semantic HTML. Tests layout, component architecture, and CSS implementation.

MCP Server Builder

TypeScript MCP server with Zod schemas, auth, and pagination. Tests protocol compliance and type safety.

Skill Creator

Write a SKILL.md with frontmatter, examples, and edge cases. Tests structured documentation and specification writing.

Excel Generator

3-year SaaS financial model with openpyxl and formulas. Tests numerical reasoning and spreadsheet architecture.

We chose these four specifically because they cover the quadrants of software engineering work where AI agents are most likely to be deployed: generating code from specs, analyzing data under pressure, reasoning about systems with complex constraints, and reviewing code for correctness and security.

Results

Model	Avg Score	Pass Rate	Total Cost	Avg Tokens
Claude Opus 4.6	0.825	4/4	$1.03	8,955
GPT-5.4	0.814	4/4	$0.48	6,042
Gemini 3.1 Pro	0.768	3/4	$0.31	4,009

Claude Opus scored highest but cost the most. Gemini was cheapest but missed the pass threshold on one skill. GPT-5.4 landed in the middle on both quality and cost.

Skill-by-Skill Breakdown

Skill	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
Frontend Design	0.75	0.75	0.75
MCP Server Builder	0.83	0.79	0.67
Skill Creator	0.88	0.92	0.86
Excel Generator	0.83	0.79	0.79

What We Learned

Frontend Design — dead heat at 0.75

All three models scored identically. Layout, semantic HTML, and responsive CSS are well within every frontier model's capability. The differentiation happens elsewhere.

MCP Server Builder — Claude leads

Claude scored 0.83 vs 0.79 (GPT) and 0.67 (Gemini). Protocol-specific tasks with Zod schemas and auth patterns favored Claude's structured output style. Gemini's lower score here contributed to its 3/4 pass rate.

Skill Creator — GPT-5.4 wins

GPT-5.4 scored highest at 0.92, ahead of Claude (0.88) and Gemini (0.86). Documentation and specification writing benefited from GPT's concise, well-organized output style.

Excel Generator — Claude leads again

Claude scored 0.83 while GPT and Gemini tied at 0.79. The financial modeling task required numerical reasoning and formula architecture where Claude's verbose, detail-oriented approach paid off.

The Economics

Model	Inference Cost	Judge Cost	Total	Cost per Skill
Claude Opus 4.6	$0.49	$0.14	$0.63	$0.16
GPT-5.4	$0.17	$0.14	$0.31	$0.08
Gemini 3.1 Pro	$0.12	$0.14	$0.26	$0.06

Judge costs are nearly identical ($0.14 each) — because every model's output gets judged by all three models regardless. The inference cost is where the gap appears.

Claude Opus costs 3.3x more than Gemini while scoring only 7% higher. GPT-5.4 sits in the middle on both cost and quality. Total benchmark cost: $1.82 for 12 model runs.

Ensemble Judging: Do Models Agree?

High agreement (spread < 0.10): The judges closely agreed on incident-response across all models. Analytical reasoning tasks have more objective criteria — the timeline is either accurate or it isn't.

Moderate disagreement (spread 0.15–0.25): PR review showed the most divergence. Claude-as-judge consistently scored higher than GPT-as-judge, particularly on completeness.

Notable pattern: Claude-as-judge was the most generous scorer overall, while GPT-as-judge was the strictest. Gemini fell in the middle. This is why the median strategy matters — it prevents any one model's scoring tendencies from dominating.

What This Means

Best quality: Claude Opus 4.6 scored highest (0.825) with the most detailed, structured outputs. Its verbose style is a strength on MCP and Excel tasks where thoroughness matters. But it costs 3.3x more than Gemini.

Best value: Gemini 3.1 Pro at $0.31 total is the cheapest option. But it missed the pass threshold on MCP Server Builder (0.67), making it the only model that didn't pass all four skills.

Middle ground: GPT-5.4 passed everything at 0.814 for $0.48. Won Skill Creator outright. Balanced cost and quality.

The spread between first and third is 0.057 points. All three are production-viable for most tasks. The choice comes down to your constraints — budget, pass rate requirements, or preference for output style.

Caveats

Single run per model per skill. Production benchmarks should average over multiple runs. We ran at temperature 0.0 for maximum reproducibility.
Software engineering only. Results may differ for creative writing, math, or other domains.
Truncation penalty. All models lost judge points on completeness due to output truncation on ambitious tasks.
Pricing is a snapshot. Model costs change. Rankings reflect March 2026 pricing.
Shared blind spots. All three models judging each other reduces individual bias but doesn't eliminate systematic blind spots they may share.