Costa OSLLM Benchmarks

LLM Benchmarks

Verified scores across 9 frontier models, 7 benchmarks. The evidence behind Costa OS routing decisions.

Updated March 26, 2026

7 benchmarks · 9 models

Official sources only. No estimates

Model Comparison

All 9 models across all 7 benchmarks. Click a column header to sort.

Model	SWE-bench	GPQA	AIME	ARC-AGI 2	Arena Elo	WebDev Elo	Terminal	Price (in/out)	Speed
Opus 4.6 Anthropic	80.8%	91.3%	99.8%	68.8%	1,504	1,549	—	$15.00 / $75.00	67 t/s
Gem 3.1 Pro Google	80.6%	94.3%	100.0%	77.1%	1,493	1,455	—	$2.00 / $12.00	—
Gem 3 Pro Google	76.2%	91.9%	100.0%	—	1,486	1,438	—	$2.00 / $12.00	128 t/s
GPT-5.4 OpenAI	77.0%	92.0%	100.0%	73.3%	1,484	1,457	75.1%	$2.50 / $15.00	—
Gem 3 Flash Google	78.0%	90.4%	99.7%	—	1,474	1,437	—	$0.50 / $3.00	200 t/s
Sonnet 4.6 Anthropic	79.6%	87.4%	83.0%	58.3%	1,463	1,523	—	$3.00 / $15.00	—
GPT-5.2 OpenAI	80.0%	92.4%	100.0%	52.9%	1,440	—	—	$1.75 / $14.00	92 t/s
o3 OpenAI	71.7%	87.7%	91.6%	88.0%	1,431	—	—	$2.00 / $8.00	—
Codex OpenAI	56.8%	91.5%	—	—	—	—	77.3%	$1.75 / $14.00	—

column winner. Click headers to sort. — means not evaluated.

Anthropic

OpenAI

Google

How Costa OS Routes Tasks

The ai-router selects a model per task class based on benchmark evidence. Here is what drives each decision.

Math & Science#1 overall

Gemini 3 Flash

94.3%GPQA Diamond

Highest GPQA Diamond score at fraction of the cost. Flash hits 90.4% while Gemini 3.1 Pro tops the chart at 94.3%. both beating Claude and OpenAI.

$0.50/M input. Cheapest frontier model

Frontend / UI#1 overall

Claude Opus 4.6

1549WebDev Arena Elo

Dominates WebDev Arena with Elo 1549, ahead of Sonnet 4.6 (1523). No other provider cracks top 2 in this benchmark.

Claude Sonnet 4.6 at 1523 for budget builds

DevOps / Terminal#1 verified

GPT-5.3 Codex

77.3%Terminal-Bench 2.0

Only model with verified Terminal-Bench 2.0 scores. 77.3% on autonomous shell tasks vs 75.1% for GPT-5.4. Purpose-built for agentic CLI work.

$1.75/M. Mid-tier pricing

General Assistant#1 overall

Claude Opus 4.6

1504Chatbot Arena Elo

Highest Chatbot Arena Elo at 1504. The most trusted signal for real-world user preference. Leads by 11 points over Gemini 3.1 Pro.

MCP ecosystem: richest tool integrations

Budget / High VolumeBest value

Gemini 3 Flash

$0.50/MInput cost

Best cost-quality ratio of any frontier model. SWE-bench 78%, GPQA 90.4%, AIME 99.7%. competitive with models 4× more expensive.

200 tok/s. Fastest measured speed

Cost vs. Quality

GPQA Diamond score plotted against input cost. X-axis is log-scaled. Models in the bottom-left offer the best value.

Anthropic

OpenAI

Google

Models to the bottom-left offer the best cost-quality balance. Gemini 3 Flash stands out.

Benchmark Deep Dives

Expand any benchmark to see scores for all evaluated models and the primary source.

Methodology

How data was collected, limitations, and how it connects to the router.

Data Collection

Scores are taken directly from official model announcements, provider benchmark pages, and independent third-party leaderboards (LMSYS, ARC Prize, SWE-bench.com). No interpolation or normalization is applied. Raw numbers only.

Source Selection

We prioritize benchmarks with public reproducibility (SWE-bench Verified, ARC-AGI 2), large-scale human preference signals (Chatbot Arena, WebDev Arena), and domain-specific tasks relevant to OS-level AI routing.

Missing Data

A dash (—) means the model was not evaluated on that benchmark or results were not publicly reported at time of collection. We do not estimate or impute missing values.

Routing Integration

These scores directly inform the Costa OS ai-router. The router uses benchmark winners per task category to select the default model, with cost and speed as tiebreakers. Routing config lives in ai-router/router.py.

Limitations

·Not all models have been evaluated on all benchmarks. Coverage gaps are shown as dashes.
·Arena Elo scores fluctuate over time as more votes accumulate. Scores reflect the most recent published leaderboard.
·Benchmark performance does not guarantee real-world task quality. The router also weighs latency, cost, and context window for each request.
·GPT-5.3 Codex (Pro variant) SWE-bench score of 56.8% is for the standard variant; Pro variant scores are higher but not publicly disclosed at time of writing.