Costa OSLLM Benchmarks

LLM Benchmarks

Verified scores across 9 frontier models, 7 benchmarks. The evidence behind Costa OS routing decisions.

Updated March 26, 2026
7 benchmarks · 9 models
Official sources only. No estimates

Model Comparison

All 9 models across all 7 benchmarks. Click a column header to sort.

ModelSWE-benchGPQAAIMEARC-AGI 2Arena EloWebDev EloTerminalPrice (in/out)
Opus 4.6
Anthropic
80.8%91.3%99.8%68.8%1,5041,549
$15.00 / $75.00
Gem 3.1 Pro
Google
80.6%94.3%100.0%77.1%1,4931,455
$2.00 / $12.00
Gem 3 Pro
Google
76.2%91.9%100.0%1,4861,438
$2.00 / $12.00
GPT-5.4
OpenAI
77.0%92.0%100.0%73.3%1,4841,45775.1%
$2.50 / $15.00
Gem 3 Flash
Google
78.0%90.4%99.7%1,4741,437
$0.50 / $3.00
Sonnet 4.6
Anthropic
79.6%87.4%83.0%58.3%1,4631,523
$3.00 / $15.00
GPT-5.2
OpenAI
80.0%92.4%100.0%52.9%1,440
$1.75 / $14.00
o3
OpenAI
71.7%87.7%91.6%88.0%1,431
$2.00 / $8.00
Codex
OpenAI
56.8%91.5%77.3%
$1.75 / $14.00

column winner. Click headers to sort. — means not evaluated.

Anthropic
OpenAI
Google

How Costa OS Routes Tasks

The ai-router selects a model per task class based on benchmark evidence. Here is what drives each decision.

Math & Science#1 overall
Gemini 3 Flash
94.3%GPQA Diamond

Highest GPQA Diamond score at fraction of the cost. Flash hits 90.4% while Gemini 3.1 Pro tops the chart at 94.3%. both beating Claude and OpenAI.

$0.50/M input. Cheapest frontier model

Frontend / UI#1 overall
Claude Opus 4.6
1549WebDev Arena Elo

Dominates WebDev Arena with Elo 1549, ahead of Sonnet 4.6 (1523). No other provider cracks top 2 in this benchmark.

Claude Sonnet 4.6 at 1523 for budget builds

DevOps / Terminal#1 verified
GPT-5.3 Codex
77.3%Terminal-Bench 2.0

Only model with verified Terminal-Bench 2.0 scores. 77.3% on autonomous shell tasks vs 75.1% for GPT-5.4. Purpose-built for agentic CLI work.

$1.75/M. Mid-tier pricing

General Assistant#1 overall
Claude Opus 4.6
1504Chatbot Arena Elo

Highest Chatbot Arena Elo at 1504. The most trusted signal for real-world user preference. Leads by 11 points over Gemini 3.1 Pro.

MCP ecosystem: richest tool integrations

Budget / High VolumeBest value
Gemini 3 Flash
$0.50/MInput cost

Best cost-quality ratio of any frontier model. SWE-bench 78%, GPQA 90.4%, AIME 99.7%. competitive with models 4× more expensive.

200 tok/s. Fastest measured speed

Cost vs. Quality

GPQA Diamond score plotted against input cost. X-axis is log-scaled. Models in the bottom-left offer the best value.

Anthropic
OpenAI
Google

Models to the bottom-left offer the best cost-quality balance. Gemini 3 Flash stands out.

Benchmark Deep Dives

Expand any benchmark to see scores for all evaluated models and the primary source.

Methodology

How data was collected, limitations, and how it connects to the router.

Data Collection

Scores are taken directly from official model announcements, provider benchmark pages, and independent third-party leaderboards (LMSYS, ARC Prize, SWE-bench.com). No interpolation or normalization is applied. Raw numbers only.

Source Selection

We prioritize benchmarks with public reproducibility (SWE-bench Verified, ARC-AGI 2), large-scale human preference signals (Chatbot Arena, WebDev Arena), and domain-specific tasks relevant to OS-level AI routing.

Missing Data

A dash (—) means the model was not evaluated on that benchmark or results were not publicly reported at time of collection. We do not estimate or impute missing values.

Routing Integration

These scores directly inform the Costa OS ai-router. The router uses benchmark winners per task category to select the default model, with cost and speed as tiebreakers. Routing config lives in ai-router/router.py.

Limitations

  • ·Not all models have been evaluated on all benchmarks. Coverage gaps are shown as dashes.
  • ·Arena Elo scores fluctuate over time as more votes accumulate. Scores reflect the most recent published leaderboard.
  • ·Benchmark performance does not guarantee real-world task quality. The router also weighs latency, cost, and context window for each request.
  • ·GPT-5.3 Codex (Pro variant) SWE-bench score of 56.8% is for the standard variant; Pro variant scores are higher but not publicly disclosed at time of writing.

Sources

All benchmark data comes directly from these public sources.

Benchmark scores feed directly into ai-router/router.py to select the optimal model for each task at runtime. Source code on GitHub.