LLM Benchmarks
Verified scores across 9 frontier models, 7 benchmarks. The evidence behind Costa OS routing decisions.
Model Comparison
All 9 models across all 7 benchmarks. Click a column header to sort.
| Model | SWE-bench | GPQA | AIME | ARC-AGI 2 | Arena Elo | WebDev Elo | Terminal | Price (in/out) |
|---|---|---|---|---|---|---|---|---|
Opus 4.6 Anthropic | 80.8% | 91.3% | 99.8% | 68.8% | 1,504 | 1,549 | — | $15.00 / $75.00 |
Gem 3.1 Pro Google | 80.6% | 94.3% | 100.0% | 77.1% | 1,493 | 1,455 | — | $2.00 / $12.00 |
Gem 3 Pro Google | 76.2% | 91.9% | 100.0% | — | 1,486 | 1,438 | — | $2.00 / $12.00 |
GPT-5.4 OpenAI | 77.0% | 92.0% | 100.0% | 73.3% | 1,484 | 1,457 | 75.1% | $2.50 / $15.00 |
Gem 3 Flash Google | 78.0% | 90.4% | 99.7% | — | 1,474 | 1,437 | — | $0.50 / $3.00 |
Sonnet 4.6 Anthropic | 79.6% | 87.4% | 83.0% | 58.3% | 1,463 | 1,523 | — | $3.00 / $15.00 |
GPT-5.2 OpenAI | 80.0% | 92.4% | 100.0% | 52.9% | 1,440 | — | — | $1.75 / $14.00 |
o3 OpenAI | 71.7% | 87.7% | 91.6% | 88.0% | 1,431 | — | — | $2.00 / $8.00 |
Codex OpenAI | 56.8% | 91.5% | — | — | — | — | 77.3% | $1.75 / $14.00 |
column winner. Click headers to sort. — means not evaluated.
How Costa OS Routes Tasks
The ai-router selects a model per task class based on benchmark evidence. Here is what drives each decision.
Highest GPQA Diamond score at fraction of the cost. Flash hits 90.4% while Gemini 3.1 Pro tops the chart at 94.3%. both beating Claude and OpenAI.
$0.50/M input. Cheapest frontier model
Dominates WebDev Arena with Elo 1549, ahead of Sonnet 4.6 (1523). No other provider cracks top 2 in this benchmark.
Claude Sonnet 4.6 at 1523 for budget builds
Only model with verified Terminal-Bench 2.0 scores. 77.3% on autonomous shell tasks vs 75.1% for GPT-5.4. Purpose-built for agentic CLI work.
$1.75/M. Mid-tier pricing
Highest Chatbot Arena Elo at 1504. The most trusted signal for real-world user preference. Leads by 11 points over Gemini 3.1 Pro.
MCP ecosystem: richest tool integrations
Best cost-quality ratio of any frontier model. SWE-bench 78%, GPQA 90.4%, AIME 99.7%. competitive with models 4× more expensive.
200 tok/s. Fastest measured speed
Cost vs. Quality
GPQA Diamond score plotted against input cost. X-axis is log-scaled. Models in the bottom-left offer the best value.
Models to the bottom-left offer the best cost-quality balance. Gemini 3 Flash stands out.
Benchmark Deep Dives
Expand any benchmark to see scores for all evaluated models and the primary source.
Methodology
How data was collected, limitations, and how it connects to the router.
Data Collection
Scores are taken directly from official model announcements, provider benchmark pages, and independent third-party leaderboards (LMSYS, ARC Prize, SWE-bench.com). No interpolation or normalization is applied. Raw numbers only.
Source Selection
We prioritize benchmarks with public reproducibility (SWE-bench Verified, ARC-AGI 2), large-scale human preference signals (Chatbot Arena, WebDev Arena), and domain-specific tasks relevant to OS-level AI routing.
Missing Data
A dash (—) means the model was not evaluated on that benchmark or results were not publicly reported at time of collection. We do not estimate or impute missing values.
Routing Integration
These scores directly inform the Costa OS ai-router. The router uses benchmark winners per task category to select the default model, with cost and speed as tiebreakers. Routing config lives in ai-router/router.py.
Limitations
- ·Not all models have been evaluated on all benchmarks. Coverage gaps are shown as dashes.
- ·Arena Elo scores fluctuate over time as more votes accumulate. Scores reflect the most recent published leaderboard.
- ·Benchmark performance does not guarantee real-world task quality. The router also weighs latency, cost, and context window for each request.
- ·GPT-5.3 Codex (Pro variant) SWE-bench score of 56.8% is for the standard variant; Pro variant scores are higher but not publicly disclosed at time of writing.
Sources
All benchmark data comes directly from these public sources.
Verified subset requiring autonomous GitHub issue resolution
Blind A/B human preference voting, millions of comparisons
Human votes on AI-generated web interfaces
ARC-AGI 2. Novel visual reasoning, tests generalization
GPQA Diamond. Expert-level multiple choice Q&A
Autonomous shell task completion in real Linux environments
Independent speed and cost measurements, AIME scores aggregation
Benchmark scores feed directly into ai-router/router.py to select the optimal model for each task at runtime. Source code on GitHub.