Multi-Model Routing in Production | Working with Large Language Models | Synoros

A router is the shrug you make when someone hands you a mixed bag of questions, some easy, some hard, and asks whether to use the same tool on all of them.

Here is the posture this chapter argues against. You started building on the OpenAI API in 2023. You picked one strong model. You send everything to it. System prompt, user turn, retrieved context, tool schema, the whole payload, into GPT-5.4 or Claude Sonnet 4.6 or Gemini 3.1 Pro. Whatever comes back is what you ship. The code is simple. The billing is legible. Nothing to argue about in a code review.

It is also, for almost any non-trivial workload, between two and ten times more expensive than it needs to be, with no quality improvement to justify the markup. That is the claim this chapter defends.

The defense has two halves. Models in April 2026 do not line up on a single axis of “better.” A cheap model can answer some queries as well as a frontier model, and faster, and for a twentieth of the price. No single model is best on cost, latency, and quality at once for every query you might send it. That is the first half. The second half: a modest discipline, which we call routing [deciding, at the moment a query arrives, which model to send it to], can recover most of the money and most of the wait time with a small amount of engineering. There are several ways to do the deciding. None is obviously best. The one that fits your workload is the one that wins.

Back to our running example for a moment. You paste a forty-page research paper into a chat and ask for five bullet points. That is one query. The same afternoon, your partner uses the same product to ask “who were the authors?” That is another. If you send both to the same frontier model, you pay the frontier price twice. One of those two prices was earned. The other was overpaid. Most of this chapter is about why, and what to do about it.

A quick note on vocabulary. Four terms show up throughout the chapter and we introduce them one at a time. A heuristic router [one that picks models by hand-written rules] is the simplest. A classifier router [one that learns to pick from examples] replaces rules with a small learned model. A cascade tries cheap models first and escalates if the answer looks shaky. A committee runs several models in parallel and aggregates their answers. These are the four families we will compare. A model zoo is the set of candidate models a router can choose from. An oracle is the router that always picks the cheapest adequate model for every query, with perfect hindsight. Real routers are measured by how close they get to that unreachable ideal.

The promise of routing. The published literature, as of April 2026, reports cost reductions from 2x (RouteLLM (Ong et al. 2024)) up to 98% at GPT-4 quality (FrugalGPT (Chen et al. 2024)), with the best classifier-based systems eliminating about 40% of frontier-model calls at no measured quality loss (Hybrid LLM (Ding et al. 2024)). Those are headline numbers on controlled benchmarks. They will not transfer verbatim to your workload. The point is that the ceiling on savings is high, not that any specific method will hit it on your traffic.

17.1 Why one model is not the answer

The case against sending everything to one model rests on a simple observation: the queries are not all alike.

Picture a production chat product on an ordinary afternoon. Some people ask one-line lookups. “What’s the refund policy.” Others ask for a format change. “Turn this into a bulleted list.” Some paste a sentence and want a sentiment label. Some paste a document and want it summarized. Some work through a short reasoning problem. A few are running the model as an agent across multiple turns. And once in a while, someone hands it a long research query of the kind only Opus or GPT-5.4 can handle well. That last category is the expensive one. In most workloads we have looked at, it is also the smallest slice by volume.

The forty-page paper lives partway up that staircase. “Who were the authors?” is ten words; a 1B-parameter model on a phone can answer it. “Summarize the whole paper” is in the middle; several models can do it well. “Reconcile this paper’s argument with three others it has never seen” is at the top; only a frontier model should be touching that one. One product, one reader, one afternoon, the whole range.

Send every one of those requests to a frontier model and you pay the frontier rate on the easy majority to get the quality you need on the hard minority. That is the arithmetic. In the case study later in the chapter, on a customer-facing assistant workload where about 80% of turns are short and factual, the difference between “send everything to Sonnet” and “route by difficulty” is roughly a 6x cost reduction, with the escalation tuned conservatively enough that an offline evaluation cannot find a quality drop. The published literature’s numbers are not far off for the same shape of workload.

The latency side of the argument is sharper still. An interactive experience, a chat widget, a voice agent, a live-search surface, needs the first word to arrive fast. Users feel a response as slow above roughly 300 ms time-to-first-token [the lag from pressing send to the first word showing up] for voice, 500 ms for text chat, and a few seconds for document generation. Frontier cloud models in April 2026 typically deliver their first token in 400 to 1,200 ms, depending on load and prompt length. That is workable for asynchronous work and painful for anything the user is actively watching. Meanwhile, Groq’s custom inference silicon serves Llama 3.3 70B at sub-100 ms time-to-first-token in the steady state. Local 4-bit 7B models on a recent Mac answer in around 50 ms. Gemini Flash-Lite sits in between. The latency gap across tiers is not 10% or 20%. It runs 5x to 20x. For certain surfaces that gap matters more than any quality gap does. Ask the reader who paused a YouTube video to wait for a voice assistant: half a second of silence feels like a bug.

Quality is the hardest of the three to summarize honestly. A scalar leaderboard like LMArena (Chiang et al. 2024) produces a single Elo number per model. That number is useful for a coarse ranking and badly misleading for routing decisions, because the model that wins on average can lose badly on specific task classes. A router that needs to know “is Haiku good enough for extraction from this document” cannot get that from an Elo number. It needs per-task-class measurements. The Artificial Analysis snapshots (Artificial Analysis 2026) and the AA-Omniscience work (Artificial Analysis 2025) at least gesture toward that picture, and most teams end up building their own eval harness for their own workload. We return to this in Section 17.4.

Apply the Sagan Baloney Detection Kit (Sagan 1995) to a generic “model X is best on benchmark Y” claim and two of the questions land hard. Is the benchmark reproducible on your distribution? Almost never. Who benefits from the claim being true? Whoever sells model X. Sagan’s point is not that benchmark claims are lies. It is that a claim made by a seller, on data the buyer cannot rerun, deserves the same skepticism as any other sales pitch. A per-task-class eval on your own traffic is what passes the kit. Nothing else does.

The picture across all three axes, as of April 2026, is that the frontier is genuinely ahead on hard tasks (agentic multi-step reasoning, code synthesis on enterprise-sized repositories, long-context retrieval with adversarial distractors), roughly matched on medium tasks (extraction, summarization, single-turn Q&A over short context), and roughly dominated by the cheap-and-fast tier on easy tasks (classification, formatting, short lookups, tone rewrites). “Dominated” is load-bearing. On easy tasks the cheap models are not merely acceptable. They are often faster and give the same answer. Paying Opus rates for a sentiment classification is not cautious engineering. It is overpaying.

Two caveats before we move on. The shape of your query distribution is a fact about your product, not about the model market. A code-review assistant sees mostly hard queries, and its routing gains are smaller. A customer-support retrieval layer sees mostly easy ones, and its gains are large. Measure your own traffic before deciding. Second caveat: the market moves. Every number in this chapter is an April 2026 snapshot, and we expect the gaps between tiers to narrow on some dimensions and widen on others as new models ship. The argument for routing survives the movement, because it is an argument about uneven query distributions, not a claim about a specific price delta.

17.2 The cost, latency, and quality trade surface

Picture the set of all available models as points in a three-dimensional space. One axis: cost per request, in dollars or fractions of one. One axis: latency, from send to first word. One axis: quality, which we have to pin down per task class, because it is not scalar.

The naive “pick the best model” view collapses this space to a line. It says: there is one quality number per model, and among the models at acceptable quality you pick the cheapest. If quality were scalar, that would be fine. It is not.

Consider a small model zoo broadly representative of the April 2026 landscape:

Claude Opus 4.7. Top of the Anthropic stack. $5/$25 per million input/output tokens (Anthropic 2026). Deep reasoning, agent-strong, slow first token.
Claude Sonnet 4.6. $3/$15. Workhorse frontier model. Fast enough for most synchronous chat, strong on general tasks.
Claude Haiku 4.5. $1/$5. Noticeably faster than Sonnet, competitive on extraction, weaker on multi-step reasoning.
GPT-5.4. $2.50/$15 (OpenAI 2026). Frontier alternative to Sonnet and Opus. Distinctive strengths in structured output and tool use.
GPT-5.4-mini. $0.75/$4.50. Cheaper frontier-adjacent tier.
Gemini 3.1 Pro. $2/$12 under 200K context, higher above (Google 2026). Strong on long context, competitive on multimodal input.
Gemini 3.1 Flash-Lite. $0.25/$1.50. Cheap enough to feel free on short work, and surprisingly strong on classification and summarization.
Groq Llama 3.3 70B. About $0.59/$0.79 (Groq 2026). Sub-100 ms time-to-first-token in steady state. Quality roughly GPT-4-class on most tasks we have tested.
Groq Llama 3.1 8B. About $0.05/$0.08. Best latency in the zoo; quality adequate for short factual and formatting work.
Local 4-bit 7B to 14B. Near-zero marginal cost per token once the hardware is paid for (Rajput et al. 2025). Fastest time-to-first-token on good hardware for short prompts. Quality varies sharply with task class.

Plot those points in (cost, latency, quality-on-extraction) space and they do not form a line. Opus is quality-high, cost-high, latency-high. Groq 70B is latency-low, cost-low, quality-mid-to-high depending on task. Local is cost-zero, latency-lowest, quality-variable. Flash-Lite is cost-very-low, latency-low, and surprising-on-easy-tasks. There is no single ordering that respects all three axes on all task classes at the same time.

Think about what that means for the forty-page paper. The “who were the authors” question is three lookups into the paper itself. A local 7B model answers it in 50 ms for essentially free. “Summarize the paper” can go to Haiku or Flash-Lite for pennies and come back in a second. “Write Python that parses every table in this paper and joins them” needs Sonnet or GPT-5.4. “Compare the paper’s method to three specific prior methods the model must recall in detail” is an Opus query. One document. Four reasonable routing decisions.

The operator’s job: characterize your query distribution by task class, measure which models clear your quality bar on each class, then among the adequate models pick the cheapest or the fastest, whichever matters for that class. That is the mental model routing is built on. It is also, not coincidentally, the structure that the classifier-based routers in the literature learn to imitate automatically.

Quality is per-task-class, not scalar. A model can be 95th percentile on structured extraction and 40th on novel math. A scalar leaderboard hides this. If you are building a router, you need an eval harness that reports quality per task class, and you almost certainly have to build it yourself, because no public benchmark covers your exact mix. This is the single most common failure mode we see: teams pick a cheap model on a general benchmark, ship it, and find out two weeks later that it gets their specific task wrong in ways the general benchmark never measured.

One more structural point before we move into strategies. Pricing for most vendors in April 2026 has three multipliers that change the math. Prompt caching [a discount for prompt prefixes the vendor has seen before in a recent window] knocks up to 90% off cached input at Anthropic, OpenAI, and Google (Anthropic 2026; OpenAI 2026; Google 2026). Batch APIs [bulk interfaces for jobs that can wait a few hours] give 50% off at most vendors. Long-context surcharges roughly double the per-token price above 200K tokens at Google, with specific tiers at Anthropic. A router that ignores all three and treats the headline per-token price as the cost of a model gets the arithmetic wrong by 2x to 5x in either direction, depending on workload. Chapter 17 (Chapter 18) walks through the detailed arithmetic. For this chapter, carry forward that the cost axis is not a fixed price but a negotiated one that depends on how you use the model.

17.3 Routing strategies: heuristic, classifier, cascade, committee

Four families of router have enough published work and deployment experience behind them to merit their own section. They are not mutually exclusive. Most production systems combine two or three. We present them in order of increasing mechanical sophistication, which is roughly the order of increasing implementation cost.

17.3.1 Heuristic routing

The simplest router is a short stack of rules a human wrote. If the query is under 30 tokens and asks for a format transformation, send to Groq 8B. If it includes a code block over 500 lines, send to Sonnet. If it mentions a filename already in the session’s tool state, let the agent’s main model handle it. If the user has clicked “deep research,” send to Opus. And so on.

Heuristic routers are cheap to build, cheap to run (rule evaluation takes microseconds), and easy to audit. When a decision surprises a developer, they can point at the rule that fired and reason about it. They are also brittle. Every new product feature, every shift in the query distribution, every new model added to the zoo, becomes a rule the team has to author and keep. Over 18 months of active development the rule stack grows to a size where new rules routinely conflict with old ones and nobody remembers why the fifth rule was added.

The honest case for heuristic routing: build it first, and in some cases keep it last. A workload that is 90% short factual lookups and 10% anything else can be routed acceptably by two rules. One catches the easy case and sends it to the cheap tier. One defaults everything else to the frontier. That stack captures most of the available savings, and it is the starting point before the team convinces itself to build a classifier.

Imagine a dumb rule catching the forty-page paper. “If the user attached a PDF over 15,000 tokens, send to a long-context model.” Two lines of Python. No training data. No drift monitoring. Catches most of the easy routing decision for that specific product surface. You would be surprised how far this takes you.

A heuristic router also pairs well with any other strategy. Most production systems we have seen use heuristics as a pre-filter (a guard that catches obvious cases before they reach a more expensive router), regardless of what sits behind them.

17.3.2 Classifier routing

A classifier router replaces hand-written rules with a small learned model that predicts, from features of the query, which tier is adequate. Features can include query length, whether the query contains code, the detected language, intent labels, embeddings, or in the most flexible variants the LLM’s own output as a feature. The target variable can be a preference label (“which model’s answer did humans prefer”), an oracle-routing label (“cheapest model that matched the frontier answer”), or a judge-model label.

RouteLLM (Ong et al. 2024) is the canonical open reference. Ong and colleagues train several router variants (similarity-weighted ranking, matrix factorization, BERT classifier, causal LLM classifier) on preference data from Chatbot Arena, targeting the decision “strong model or weak model.” The reported result: about 2x cost reduction with quality maintained at the 80% threshold on MT-Bench and MMLU. The specific techniques matter less than the shape of the result. A classifier trained on a preference signal and deployed in front of a two-tier model zoo can cut frontier-model usage in half with careful eval.

Hybrid LLM (Ding et al. 2024) comes at the same problem with a different objective. Rather than training on preference labels, Ding and colleagues train the router to predict the quality gap between a small model and a large one, and send to the small model whenever the predicted gap is below a tunable threshold. The reported numbers: up to 40% fewer frontier calls at no quality loss on their eval. The quality-gap framing is more honest than preference routing, because it acknowledges the router is not picking “the better answer.” It is picking “an answer not worse than the frontier by more than threshold T.”

The appeal of classifier routing is that it adapts. A team ships a new feature, the query distribution shifts, the router is retrained on recent traffic, the savings come back. The cost is real. You are now running and maintaining a small ML system, with its own training pipeline, its own drift monitoring, its own retraining cadence. The router becomes a component with operational surface area. Most teams we have seen who went directly to classifier routing without first running a heuristic version for a quarter regretted it, because they did not yet understand their own query distribution well enough to know what the classifier should be targeting.

Two practical notes. A classifier’s features have to be available at routing time, which rules out features that require calling the LLM first. You are not going to call the expensive model to decide whether to call the expensive model. That constraint is tighter than it sounds, and it rules out several tempting signals. Second, the labels matter more than the model class. A classifier trained on preference labels learns one thing. A classifier trained on oracle-cost labels learns another. The two routers can disagree in production on the same query, and neither is obviously right.

Walk the classifier through the forty-page paper once. “Who were the authors” arrives: short, no code, contains the question word “who,” flagged extraction. The classifier emits “local” or “Haiku.” “Summarize the paper” arrives: long attachment, no code, classified “summarization.” The classifier emits “Haiku” or “Sonnet” depending on the quality-gap threshold. “Compare the argument to three others” arrives: long attachment, multiple proper-noun entities, classified “deep reasoning.” The classifier emits “Opus.” Same product, same reader, three different routes in a single session. That is what a classifier is for.

17.3.3 Cascade routing

A cascade tries cheap first and escalates on signal. Send the query to Haiku. If Haiku’s answer looks confident and well-formed, return it. Otherwise, send the same query to Sonnet and return that. The escalation signal can come from several places. The cheap model’s log-probabilities [the model’s own self-reported confidence in each word it generated]. A lightweight judge’s score. A self-consistency check [asking the same question twice and seeing whether the answers agree]. A format-validation check (did the JSON parse, did the tool call match schema). An explicit fallback signal from the cheap model, like “I don’t know.”

FrugalGPT (Chen et al. 2024) introduced the cascade framing to the LLM routing literature. The paper reports up to 98% cost reduction while matching GPT-4 quality on several benchmarks, using a three-tier cascade with self-consistency as the escalation signal. The 98% number is a ceiling on specific benchmarks, not a general claim. Run the Sagan checklist on it: is it reproducible on your traffic? Who benefits from it being true? The durable contribution of the paper is not the 98%. It is the structure, cheap-first with a cheap-to-compute escalation signal, which has become the standard deployment pattern for cost-sensitive production systems.

Yue and colleagues’ Large Language Model Cascades with Mixture of Thoughts (Yue et al. 2024) extends cascades by mixing representations. The cheap tier generates intermediate thoughts in more than one format (chain-of-thought and program-of-thought), and the cascade escalates when those thoughts disagree. Reported cost: around 40% of full-frontier spend at comparable performance on reasoning benchmarks. The structural point: “confidence” does not have to be a single signal. Agreement across representations is itself a useful escalation trigger.

Think of a cascade as a spend-shaping mechanism. It pays roughly the cheap-tier cost on easy queries, roughly the cheap cost plus the frontier cost on hard queries (you pay for both, because you called both), and it requires that the escalation trigger itself be cheap. On a distribution where easy queries dominate, a cascade wins by a lot. On a distribution where hard queries are the majority, a cascade is worse than going straight to the frontier, because you are paying the cheap-tier cost as a useless overhead on every hard query. Measure before you commit.

Back to the paper. Send the summarize request to Haiku. Haiku comes back with a clean five-bullet summary. A lightweight check runs: the answer has five bullets, each under 30 words, no “I don’t know” tokens. The cascade returns Haiku’s answer. Total cost: pennies. Now send the “compare to three others” request. Haiku comes back with something that starts “I don’t have access to the three other papers you mentioned.” The check fires. The query goes to Sonnet, which does know a couple of the three from pretraining and says so explicitly. The user gets Sonnet’s answer. Total cost: Haiku + Sonnet, plus the tiny overhead of running the check.

There is a subtle trap with cascades. Confidence signals from cheap models are systematically miscalibrated. Kalai and colleagues (Kalai et al. 2025) argue, and we agree, that the training objective for most chat models rewards confident guesses over admitting ignorance. The log-probability of a cheap model’s answer is often high regardless of whether the answer is correct. An escalation signal built on raw log-probs misses many cases where the cheap model was confidently wrong. Confidently wrong is exactly the hallucination case we most wanted to catch. FrugalGPT’s self-consistency signal (ask twice, compare) is a stronger trigger than log-probs precisely because disagreement between two generations is harder to fake than overconfidence in a single generation.

17.3.4 Committee routing

A committee runs several models on the same query in parallel and aggregates. The aggregation can be majority vote on categorical outputs, averaging on scalar outputs, best-of-N with a judge model doing the selecting, or more sophisticated consensus protocols. Committee routing is the most expensive strategy (you pay for every model every time) and the one with the highest quality ceiling.

The research literature on LLM committees is thinner than on the other strategies. The general machine-learning result that ensembles reduce variance, which matters when the cost of a wrong answer is high, is well established. In LLM practice, the pattern shows up in three places. In high-stakes applications (medical triage, legal review, code-review sign-off) where teams run two or three frontier models and require agreement before taking an action. In adversarial or safety-critical settings, where the same query runs through a capable model and a safety judge, and the judge can veto. And in research labs chasing benchmark scores, where the cost of running the committee is irrelevant because the budget is paying for a leaderboard number, not for production traffic.

In production systems with a budget to defend, committee routing is a tool for the critical path, not the default. The structure we see work: route the query through a cheap or cascade path first, and only invoke a committee when the stakes justify it (user flagged the answer; the tool call is about to be irreversible; the answer claims high confidence on a load-bearing fact). The rest of the time, committee is overkill. A production system that defaults to committee routing across all traffic is paying 2x to 5x the necessary cost, and if it can defend the quality gain against an honest offline eval, it is almost always because it picked a judge-friendly benchmark rather than a user-facing one.

The paper-summary workflow almost never justifies a committee. A summary is not irreversible, it is not load-bearing in the way a drug-interaction check is, and the user can re-run it in a different model if the first one looks off. Committee routing is for the moment when a legal-review tool is about to produce a final citation list, or a medical triage tool is about to recommend an action. Sparingly. With a real eval budget behind the claim that the committee helps.

17.3.5 How to choose

There is no single best strategy. The choice depends on your distribution and, honestly, on your team’s skill. A rough decision procedure, from our experience and the literature:

Skewed-easy distribution (say 80% of queries are fast-path): start with heuristic routing. Add one cascade step if the cheap tier’s errors are noticeable. A classifier is overkill for now.
Balanced distribution: a classifier router is the sensible long-term shape. Start with a heuristic version to learn your distribution, then replace the rules with a learned model once you have six months of traffic logs and can train against oracle labels.
High-variance budget (some queries are worth 10x more than others): a cascade is the right primitive. The cheap tier absorbs most traffic, the expensive tier handles the high-stakes overflow.
Critical-path, low-volume, high-stakes: committee routing is justifiable here and only here. Use sparingly. Measure the quality gain against a proper eval. Do not eyeball it.

Default recommendation. Heuristic pre-filter always. Cascade as the primary cost-shaping mechanism. Classifier where the heuristic and the cascade cannot keep up with distribution shift. Committee on critical paths only, and only with an eval budget to prove the quality gain. Teams that skip the heuristic step and go straight to a classifier or committee are buying complexity they do not yet need.

17.4 Routing benchmarks and the measurement problem

There is a benchmark literature for routers, and it is both useful and misleading in characteristic ways.

RouterBench (Hu et al. 2024) is the most developed public router benchmark as of April 2026. Hu and colleagues curated 405,000 inference outcomes (roughly the number of words in War and Peace, three times over) across a model zoo and a task suite, with cost and quality labels, and defined a set of oracle and reference routers to compare against. The benchmark is valuable for three things. It gives a shared vocabulary (what does “router accuracy” mean, quantitatively). It provides a clean regret metric [how much worse your router did than the oracle that would have picked the right model every time]. And it catches the obvious pathology of a router that has degenerated into sending everything to one model.

What RouterBench does not do, and does not claim to do, is predict how a router will perform on your workload. The benchmark task distribution is fixed. Your product’s query distribution is different. The benchmark model zoo is fixed. You probably have different models in mind. The benchmark quality labels were generated by a specific judge. Your quality definition is probably different. A router that wins on RouterBench is not guaranteed to win on your production traffic, and a router that loses on RouterBench might do fine on your traffic, because your distribution is less adversarial than the one the benchmark was built to stress.

This is the general problem in LLM evaluation: the benchmark and the workload are not the same distribution. It hits routers especially hard, because a router sits in front of the model. Small miscalibrations in the classifier compound across every query. A router that is correct 95% of the time still sends 5% of queries to the wrong model. If those 5% happen to be concentrated on a specific hard slice of your distribution, your users see a quality regression that the router’s aggregate accuracy hid.

Here is what that looks like in practice. Imagine you ship a classifier router on the paper-summary workflow. Offline it hits 95% accuracy. In production, users stop clicking through from the “compare to other papers” section. You dig into the logs. The classifier is routing “compare” queries to Haiku about a third of the time, because compare queries look short and the classifier learned that short queries belong on Haiku. Haiku handles the easier compare queries fine and produces confident, wrong answers on the harder ones. Your 5% error rate is 100% on the slice of traffic that matters most for that feature. Aggregate accuracy hid the bleeding.

The operational discipline we recommend: run production routers in shadow mode [the router’s output is logged but not used; a simpler router still makes the live decision] first. When a query arrives, route it for real with the current router, and log what each candidate router-plus-model would have decided. After a few weeks, replay the shadow traffic offline and measure regret: how much cost or quality did you lose compared to a retrospective oracle (the model that, in hindsight, was the right pick)? Regret is the metric that predicts whether changing your router will help. RouterBench numbers are a starting-point sanity check, not a production signal.

The 2025 literature has also started to surface a failure mode called router collapse. A classifier router trained on biased data converges to sending everything to one model, usually the one with the highest average win-rate in the training set. A collapsed router scores well on average accuracy (it agrees with the oracle when the oracle picks the frequent winner) and badly on regret against the tails of the distribution. The papers describing this are early and not yet converged on a standard citation, but the phenomenon is real enough that any classifier-router pipeline should include a distribution-check on router outputs (is the router using the cheap tier the expected fraction of the time?) as a basic health metric.

Instrument every routing decision. A production router that does not log “which candidate did I pick, and what did each of the alternatives cost and latency at?” is a router you cannot improve. Instrumentation matters more than the initial design, because the initial design will be wrong on a workload you do not yet understand, and iteration is only possible on data you kept.

One more point on measurement. Quality evaluation inside a router loop often uses LLM-as-judge scoring. LLM judges carry their own biases (position, self-preference, verbosity), documented in Zheng and colleagues (Zheng et al. 2023). A router trained against a biased judge inherits the judge’s biases. This is not a showstopper. It is a reason to recalibrate the judge periodically against human labels on a small held-out set, and to treat judge scores as a useful-but-noisy signal rather than ground truth. Chapter 15 (Chapter 16) handles the judge-validity question in more depth.

17.5 The Costa router: a case study

The Costa router is a multi-model routing layer we run in front of several production workloads: Costa OS’s built-in voice and keybind assistant, the Synoros platform’s search and summarization tiers, and a handful of agentic tools around our development environment. The design goals were specific: be aware of per-task-class quality, not just scalar quality; treat latency as a first-class constraint; expose routing decisions as instrumented events so we can measure regret against a retrospective oracle.

The architecture, as shipped today, is a small PyTorch classifier that predicts a route label from features of the incoming query, with a regex fallback when the classifier has not yet been trained or its confidence is low, and an “I don’t know” detector on the local model’s output that escalates to cloud models when the local tier punts. Concretely:

Meta pre-filter. Questions about the system itself (“what can you do”, “show me my usage”, “what time is it”) get answered by a handful of hard-coded responses before routing. This is the heuristic step. It catches a small but real slice of traffic at zero model cost.
Classifier stage. A small MLP reads hand-engineered features of the query (length, detected keywords for action, code, web, file-search, system-info, a character-trigram hash, entity hits against known-external and known-local vocabularies) and predicts one of seven route labels: local, local-will-escalate, haiku+web, sonnet, opus, file_search, window_manager. When the classifier’s confidence falls below 85%, the system asks the local LLM to classify the query instead, and takes whichever answer is more confident.
Local tier. Routes labeled local go to whichever Ollama model is currently loaded on the GPU. A separate heuristic chooses which local model to load based on the predicted task category (architecture, code_write, deep_knowledge, system_info, and similar) and available VRAM, preferring already-warm models when they score within 5% of the category-best model, to avoid a 2 to 6 second model-swap cost.
Cloud tier. Routes labeled haiku+web, sonnet, or opus shell out to the Claude CLI with the corresponding model. This is a cheap-to-build cloud escalation path: the CLI handles auth, streaming, and tool use for us.
Auto-escalation. When a local-tier answer comes back and matches an “I don’t know” regex (dozens of patterns: “I don’t have access”, “as an AI model”, “you would need to check”, “my knowledge cutoff”, and so on), the router re-issues the query to Claude Haiku rather than returning the punt to the user. This is a cascade step, and the escalation trigger is phrase-matching on the local model’s output rather than log-prob confidence or self-consistency.

A few things the current system does not do that the chapter’s literature review discusses. There is no committee path in production. There is no self-consistency check on the cheap tier; the escalation signal is phrase detection, which we know is crude. There is no Groq or Gemini tier in the routing decision today, though a gemini provider exists for delegated-thinking subtasks. The model zoo as shipped is Ollama local plus Claude Haiku plus Claude Sonnet plus Claude Opus, not the fuller zoo the strategy sections describe.

That gap is deliberate. The literature review describes the design space. The shipped router describes what we found cost-justified for our own traffic in April 2026, which is dominated by short Costa OS system queries where local plus Haiku covers the vast majority of calls, and the gain from a more diverse zoo has not yet cleared the complexity bar. If your workload differs, the strategy sections above describe the space you have to pick from.

Now the forty-page paper, run through the shipped system. Paste it in and ask “summarize in five bullets.” The classifier reads the features: long attachment, short query, keyword “summarize,” low code signal. If the loaded local model has a context window big enough for the paper, the route goes local. If not, the route goes to Haiku. Ask “who wrote this?” The classifier reads: very short query, no code, contains “who,” identified as extraction. Route: local. Ask “write Python to extract every table.” Classifier: code keyword, action keyword, substantial length. Route: sonnet. Ask “compare this paper’s argument to Dziri et al. 2023 and Mirzadeh et al. 2025.” Classifier: multiple proper-noun entities, deep-reasoning keyword, no code. Route: opus. One document, one reader, four different routes. The router logs each decision so that regret against a retrospective oracle can be measured later on the actual traffic.

A first-party router is not a research project. The Costa router exists because every production system we run has a different query distribution and a different cost-to-latency budget, and no off-the-shelf router is tuned to any of them. The payoff is not that our router beats RouteLLM on RouteLLM’s benchmark. The payoff is that we can measure regret against a retrospective oracle on our real traffic, and that the thing we ship reflects decisions we made with our eyes open. That is the posture we recommend: not “run the best public router,” but “run the simplest router you can measure on your workload, and improve what you measure.”

17.6 What to carry forward

Four claims from this chapter underwrite the next one, on the cost arithmetic that routing enables.

First, the queries are the workload, not the model. A router is a commitment to take your own distribution seriously and to stop paying frontier rates on queries that do not need them.

Second, cost, latency, and quality do not line up on a single axis, and quality is per-task-class. Any decision procedure that collapses these to one number is throwing away information your users will feel.

Third, there are four routing-strategy families (heuristic, classifier, cascade, committee), and the sensible production shape combines them. Heuristic pre-filter, cascade as the primary cost-shaping mechanism, classifier on the escalation decision, committee on critical paths only.

Fourth, a router you cannot measure is not a router. It is a configuration file. Instrumentation, shadow replay, and regret against a retrospective oracle are the disciplines that separate a routing system from a routing theory. Chapter 15 (Chapter 16) handled the observability substrate.

Back to the forty-page paper one last time. A reader who finished this chapter should now, when they paste the paper into a chat and ask for five bullet points, be able to see at least two layers of machinery they were previously blind to. They should see that the system may have already routed their query before the model even read the first page. And they should be able to guess whether their product is paying a frontier rate on what was always a Haiku question.

Anthropic. 2026. Claude API Pricing. Official pricing page, snapshot April 2026. https://platform.claude.com/docs/en/about-claude/pricing.

Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.

Artificial Analysis. 2026. Model Intelligence, Performance, and Price Snapshots. Artificialanalysis.ai, continuously updated. https://artificialanalysis.ai/.

Chen, Lingjiao, Matei Zaharia, and James Zou. 2024. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” Transactions on Machine Learning Research. https://arxiv.org/abs/2305.05176.

Chiang, Wei-Lin, Lianmin Zheng, et al. 2024. “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.” International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2403.04132.

Ding, Dujian, Ankur Mallick, Chi Wang, et al. 2024. “Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.” International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2404.14618.

Google. 2026. Gemini API Pricing. Official pricing page, snapshot April 2026. https://ai.google.dev/gemini-api/docs/pricing.

Groq. 2026. Groq on-Demand Pricing for Tokens-as-a-Service. Official pricing page, snapshot April 2026. https://groq.com/pricing.

Hu, Qitian Jason, Jacob Bieker, Xiuyu Li, et al. 2024. RouterBench: A Benchmark for Multi-LLM Routing System. arXiv:2403.12031. https://arxiv.org/abs/2403.12031.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.

Ong, Isaac, Amjad Almahairi, Vincent Wu, et al. 2024. RouteLLM: Learning to Route LLMs with Preference Data. LMSYS, arXiv:2406.18665. https://arxiv.org/abs/2406.18665.

OpenAI. 2026. OpenAI API Pricing. Official pricing page, snapshot April 2026. https://openai.com/api/pricing/.

Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.

Sagan, Carl. 1995. The Demon-Haunted World: Science as a Candle in the Dark. Random House.

Yue, Murong, Jian Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-Efficient Reasoning. arXiv:2310.03094v3. https://arxiv.org/abs/2310.03094.

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2306.05685.