Cost-Tiered Deployment | Working with Large Language Models | Synoros

The cheap model handles the easy questions, the expensive model handles the hard ones, and the trick is telling them apart before the bill comes due.

Chapter 17 (Chapter 17) made the strategic argument: one model is not the answer, and routing [sending each request to the model best suited to it, instead of sending every request to the same model] is a first-class production discipline. The arithmetic comes next.

Every number in this chapter is stamped April 2026. Vendor pricing is the single most perishable content in the book. The tiers, the ratios, the order-of-magnitude comparisons all age gracefully; the dollar figures age in months. Treat them as snapshots, not laws. A pricing-snapshot adjunct file ships alongside the book, and the refresh cadence is quarterly.

Here is the opinion of this chapter, stated up front, because it runs against the default posture of most teams we talk to. Pick any production team that has a running AI feature. Watch what happens when a user asks the feature a question. In most cases, that question goes to whatever sits at the top of the vendor’s ladder, Claude Sonnet 4.6 or GPT-5.4, because it was the default the developer picked in a hurry and nobody has gone back to reconsider. The question might be easy. The question might be trivial. The question might be one a model costing a fiftieth as much would have answered identically. It does not matter, because the routing choice was made once, in a configuration file, by someone who had other things to do. A team running this way is typically paying somewhere between two and ten times what a competently-tiered setup pays for the same user-visible quality. That gap is not a rounding error. At any meaningful volume, it is the difference between a margin-positive feature and a line item the finance team asks you to justify.

Keep the paper-summary task in mind, the one we have followed since Section 1.1. You have a forty-page research paper and you want it reduced to five bullet points. Under a single-model default, every copy of that summary runs through the same top-of-ladder model at the same top-of-ladder price. The first reader who is curious about a new paper, the automated pipeline that summarizes every new arXiv submission every night, and the bulk enrichment job that re-summarizes last year’s corpus against a new taxonomy: all three pay Sonnet rates. Most of those summaries would have been handled competently by something a fifth the cost. A few of them genuinely do need the frontier. Tiering is how you stop paying frontier prices for the easy summaries while still paying them for the hard ones.

18.1 The six-tier escalation ladder

The canonical ladder, as of April 2026, has six rungs. Each rung has a job. The job determines when you reach for it, and when you skip it and escalate.

Local (a 7B to 14B model [a model with 7 to 14 billion parameters, small enough to fit in a single consumer GPU’s memory] at 4-bit quantization a compression trick that stores each of the model’s numbers in 4 bits instead of the native 16, shrinking memory use by roughly 4x, 22, running on Apple Silicon unified memory or a consumer GPU like an RTX 4090 or Radeon 9060 XT). Zero marginal cost per token. Sub-second latency on short prompts. The ceiling is visible: stripped of frontier-scale training, compressed to roughly a quarter of its native precision, it handles the bottom half of your request distribution and nothing more. See Section 18.2 for the amortization math.
Groq, running Llama 3.3 70B at roughly 250 to 300 tokens per second (April 2026). Priced at roughly $0.59 per million input tokens and $0.79 per million output tokens (Groq 2026). The niche is latency: there is almost nothing else at frontier-adjacent quality that ships a first token in under 200 ms. You reach for Groq when the user is watching a cursor blink, not when you are running an overnight batch.
Gemini 3.1 Flash-Lite at $0.25 per million input tokens and $1.50 per million output tokens (Google 2026). The cheapest cloud model with frontier-house branding. Acceptable on well-specified structured tasks. Flaky on anything that wants judgment.
Claude Haiku 4.5 at $1 per million input tokens and $5 per million output tokens (Anthropic 2026a). The working horse of high-volume Anthropic deployments. Noticeably smarter than Flash-Lite on conversational tasks and instruction-following. Costs four times as much.
Claude Sonnet 4.6 at $3 per million input tokens and $15 per million output tokens (Anthropic 2026a), or GPT-5.4-mini at $0.75 per million in and $4.50 per million out (OpenAI 2026). The two defaults for competent production work. Sonnet wins on long-form reasoning and agent tasks; GPT-5.4-mini wins on cost. A careful shop runs both on adversarial evals [tests designed to find where a model breaks, not tests designed to flatter it] and picks per task class.
Claude Opus 4.7 at $5 per million input tokens and $25 per million output tokens (Anthropic 2026a), or GPT-5.4 at $2.50 per million in and $15 per million out (OpenAI 2026). The frontier tier. Reach for it when the task genuinely needs frontier reasoning, not because it is the default knob on the dashboard.

Put those six rungs side by side. The cost spread from rung 1 to rung 6 is four orders of magnitude on per-token spend: zero at the bottom, twenty-five dollars per million tokens at the top, a factor of tens of thousands between. For scale, at a penny per request you can run ten thousand requests for a hundred dollars; at the frontier-tier rate on a comparable request you pay the same hundred dollars for a few hundred requests. The latency spread is two orders of magnitude: 30 milliseconds to first token on a local 4-bit 7B, two to five seconds on Opus with a long prompt. But the quality spread on a hard task is less than one order of magnitude. Roughly, the frontier model is maybe six or eight times as good as the local one on the tasks where the local one struggles most, not ten thousand times as good.

Those three numbers are the whole argument for tiering. If quality scaled as aggressively as cost does, you could just always buy the top tier and stop thinking. It does not. The cost ladder is steep. The quality ladder is a gentle slope. You are paying a ruinous premium at the top for a modest quality bump.

The cost ladder is steeper than the quality ladder. A four-order-of-magnitude cost spread compresses to roughly a one-order-of-magnitude quality spread on well-specified tasks. That asymmetry is the reason routing works. Pretend a frontier model is three times as good as a local 7B on your task class, but a thousand times as expensive per token. The arithmetic answers itself.

A word on vocabulary before we go further. Input tokens are the tokens you send to the model: system prompt, conversation history, retrieved documents, the current user turn, tool schemas, tool results. Output tokens are the tokens the model generates back. Output is priced at roughly three to five times the input rate across every major vendor in April 2026. This is a property of the hardware, not a pricing whim. When the model reads your prompt (the “prefill” step), it can chew through thousands of input tokens at the same time, because the GPU can look at all of them in parallel. When the model writes its response (the “decode” step), it has to produce one token, read what it just wrote, produce the next, and so on, a strictly sequential process. On each of those decode steps the GPU has to pull the entire model’s weights out of its memory to make a decision about a single token. That shuttling of weights from memory to the compute cores is the bottleneck, and it happens once per output token. So output tokens cost more because producing them is slower, per token, than reading them is.

This matters for the paper-summary task in a way that is worth pausing on. Pasting a forty-page paper in and asking for five bullet points is a long-input, short-output workload. The model swallows a hundred thousand input tokens (prefill, fast, cheap per token) and produces a few hundred output tokens of summary (decode, slow, expensive per token). Compare that to the opposite shape: handing the model a one-sentence prompt and asking it to write you forty pages of original prose. Same total token count. Dramatically different bill, because almost all the tokens are on the expensive side of the ledger. The paper summary is cheap for the same reason a library checkout is cheap compared to writing a novel: reading bulk text is fast, producing bulk text is slow.

Two tiers are deliberately absent from the ladder. The first is the open-source-via-API tier that mirrors what we already have locally: Llama 3.1 8B on Together or Fireworks, Mistral 7B on one of several hosts. These price at roughly $0.05 to $0.10 per million tokens (Groq 2026), and they exist. We fold them under “Groq” in the ladder above because the selection criterion is the same (cheap, fast, small-model ceiling), the quality sits within noise of each other, and calling out three near-identical options adds more complexity than it earns. The second is the ultra-cheap nano tier: GPT-5.4-nano at $0.20 per million in and $1.25 per million out (OpenAI 2026). We skip it in the canonical ladder not because it is bad, but because at that price point, if you have the option of running locally, you should measure whether local is cheaper (it usually is past a threshold we will establish in Section 18.2), and if you do not have that option, Flash-Lite sits at a comparable price point with broader instruction-following. Your mileage will vary.

18.1.1 When to skip a rung

Every escalation step costs you latency and money. Every step you skip costs you quality on the requests where the cheaper tier was underqualified. The balance is per-workload.

Skip local when you are latency-insensitive and volume-light. The capital expenditure on hardware does not pay back; just use Flash-Lite.
Skip Groq when you do not have a hard latency budget. Its value proposition is tokens-per-second, not dollars-per-token. At comparable quality, Flash-Lite is cheaper.
Skip Flash-Lite when quality matters more than the delta to Haiku. A four-times-price-premium for Haiku is often worth it on anything conversational.
Skip Haiku when you are already paying for Sonnet on the hard requests. At the margin of a routed system, the Sonnet escalation path is usually cheaper overall than running Haiku as a separate tier, because the Haiku-adequate band tends to overlap heavily with the Flash-Lite-adequate band.
Skip Sonnet or Opus when you can. Most teams do the opposite.

Every rung-skip is a hypothesis about your workload’s distribution. We formalize how to test those hypotheses in Section 18.4.

18.2 Local-tier economics

The local tier is the only rung where the per-token price on the pricing page is zero. That does not mean local is free. It means the cost hides somewhere else. You paid it in one lump when you bought the machine, and now the machine is producing tokens, and the real per-token price depends on how many tokens the machine produces before it dies or becomes obsolete.

The math fits on an envelope. For a given local setup:

amortized cost per token = (hardware CAPEX + energy over lifetime) / (tokens produced over lifetime)

Three numbers drive the answer: what the hardware cost, how many tokens a day you push through it, and how long you run it before it gets replaced. Energy is a tax on top, smaller than you might expect for a single box and larger than you expect for a rack.

Here is what that looks like with real numbers. Put a $10,000 Apple Silicon box on your desk. M2 Ultra, 192 GB of unified memory [memory shared between the CPU and GPU on the same chip, which matters for LLMs because it lets the GPU work with much larger models than the 24 GB you get on a consumer card]. Rajput et al. measured a production-grade MLX setup on this hardware at roughly 230 tokens per second on Llama 70B at 4-bit quantization, and roughly 150 tokens per second on llama.cpp at the same quantization (Rajput et al. 2025). Call it 200 tokens per second as a blended figure for planning.

Now run the machine flat out, every second of every day. Two hundred tokens a second, times 86,400 seconds a day, gives you 17.28 million tokens a day. To feel what that means, remember the paper-summary task. Forty pages of research paper is somewhere around 20,000 tokens. At full throttle, the box on your desk is digesting roughly 860 such papers a day, one paper every hundred seconds, around the clock. Or in the same units we used for context windows in Section 1.1: the box is reading a full copy of Crime and Punishment every ten minutes, seven days a week, and producing a new summary of something every few seconds while it does. That is the theoretical capacity of the hardware.

Amortize the $10,000 over two years of that. You get 12.6 billion tokens out the far end, at a hardware cost of $0.0008 per thousand output tokens, or $0.80 per million. Set that next to the April 2026 cloud prices. Haiku’s output rate is $5 per million, six times more. Flash-Lite’s is $1.50 per million, almost twice as much. Even Groq’s 70B rate at $0.79 per million is within spitting distance. Run the box for four years instead of two, and the per-token figure halves. The box starts to look like a bargain.

At 100% utilization. There is the catch.

No one runs a local inference box at 100% utilization. Real workloads have business hours, bursty traffic, long idle windows when everyone is asleep or in a meeting. The fraction of time the hardware is actively producing tokens (instead of sitting idle, fans spinning, consuming electricity for nothing) is what operators call the duty cycle. A realistic duty cycle for a single-tenant local box in production sits somewhere between 1% and 15%, not 100%.

Run the arithmetic again at those numbers and the picture changes hard. At 1% duty cycle, the amortized rate climbs by a factor of 100, to roughly $80 per million tokens. That is sixteen times more than Opus. The same $10,000 box that looked like a bargain at full throttle is now the most expensive option on the ladder. At 15%, it lands at about $5.30 per million tokens, nearly breakeven against Haiku. The duty-cycle assumption is the whole argument. Anyone who quotes local-inference economics without naming a duty cycle is quoting a best case and hoping you do not check the footnotes.

The practical consequence is a breakeven volume. Below some threshold of tokens per day, cloud wins. Above it, local wins. Public 2026 TCO analyses converge on roughly 5,000 to 10,000 requests per day as the breakeven range for most setups, assuming average requests of a few thousand tokens (Various 2025–2026 TCO analyses 2026). Below 5K requests per day, buying hardware is a way to spend your money. Above it, buying hardware is a way to save your money. The exact threshold depends on request size, duty cycle, and which cloud tier you are replacing.

Tie this back to the paper-summary task. Suppose you are an individual researcher who wants to summarize, say, five new papers a week for your own use. That is maybe 50 summaries a month, 100,000 tokens a month. Buying a $10,000 box to do that is roughly the worst economic decision available; you should be using Flash-Lite or Haiku for pennies a month. Now suppose you are running a research-paper ingestion service that summarizes every arXiv submission every night, tens of thousands of papers a month. The math flips. The cloud bill at that volume climbs into serious money. A local fleet at meaningful duty cycle starts paying for itself in months, not years.

Local wins at volume, or at privacy, or at latency. Not at convenience. The per-token math looks good at the tail of the usage curve and truly bad at the head. A team considering a Foundry or CostaBox for its “AI workflow” of 200 requests a day is almost certainly making a worse-than-cloud economic decision. A team routing millions of requests a day through a local fleet is almost certainly making a better-than-cloud one. The transition happens at volumes most teams overestimate when they first build.

The honest operator framing is that local inference economics are a function of your actual usage pattern, not of your hardware. Measure first, specify second. The generic “local is cheaper” talking point is true at the right volume and false at the wrong one.

18.3 Confidence-based fallback

A tiered system only pays off if the cheap tier is adequate on the queries it handles. “Adequate” means the user or the downstream system would accept the answer. The hard problem, the one at the heart of cost-tiered deployment, is detecting adequacy in real time, before you commit to shipping the cheap answer. You do not get to ask the user. You have to decide yourself, from the cheap tier’s output alone, whether to ship it or re-run it on the frontier.

Back to the paper-summary task. Suppose Haiku just produced five bullet points about a forty-page paper. You are about to return those bullets to the user. How do you tell, in the 50 milliseconds you have, whether Haiku understood the paper or produced a plausible-looking hallucination? That is the confidence problem, and you have four practical tools for it.

Output-shape validity. The cheapest signal by far. If the task specifies a JSON shape, does the cheap output parse? If it specifies a tool call, does the tool-call payload have the right fields? Does it satisfy the schema? This catches a large share of low-tier failures for free, because weaker models produce malformed outputs at a measurably higher rate than strong ones. It costs nothing beyond the parse.
Self-consistency. Run the cheap tier twice (or k times, small k). If the answers agree, ship one of them; if they diverge, escalate. FrugalGPT’s cascade [a pipeline that runs a cheap model first and only falls back to an expensive one when the cheap answer looks unreliable] uses this signal and demonstrates up to 98% cost reduction at GPT-4-equivalent quality on their benchmarks (Chen et al. 2024). Yue et al.’s cascade paper (Yue et al. 2024) formalizes “answer consistency” as the escalation trigger and lands in the same general cost range. The cost of self-consistency is k times the cheap tier, which usually still comes in much lower than a single frontier call.
Judge-model scoring. Ask a stronger model (or the same model on a verification prompt) whether the cheap answer is correct. Highest-quality signal of the four. Also highest cost, because you pay for a second inference on every request. The operator trap is obvious: if your judge is the frontier tier, you have re-introduced the cost you were trying to avoid. The judge should almost always sit one or two rungs cheaper than the model you would escalate to, not the escalation target itself.
Log-probability of the output. The model’s own confidence, read off the sampler [the component that picks which token to write next from the model’s probability distribution]. Intuitively, a low-probability answer means the model was uncertain. In practice, log-prob sits systematically miscalibrated across models. The training objective that produces hallucination (Kalai 2025’s incentive argument, unpacked in Section 1.4) also pushes the model to sound confident on wrong answers (Kalai et al. 2025). A model rewarded for sounding sure, even when it is wrong, will produce confident-looking log-probs on its hallucinations. Log-prob is useful as one signal among several. Never as the only one.

The editorial call: if you can only afford one confidence signal, use output-shape validity plus self-consistency, not a judge model. Self-consistency is cheap and well-supported in the cascade literature. Shape validity costs nothing. A two-out-of-three self-consistency check on Flash-Lite, combined with schema validation, catches the vast majority of what a Sonnet escalation would catch, at a fraction of the cost.

Before you trust any of these four signals, apply some of the skepticism Sagan taught about vendor claims in the Demon-Haunted World baloney-detection sense (Section 1.4): is the reported cost reduction reproducible on workloads that do not look like the paper’s benchmark? Does the cheap tier in the cascade paper have the same quality floor as the cheap tier you are about to use? Who benefits from the “98% cost reduction” headline? Cascade papers are careful, but benchmark distributions rarely match production distributions. Your mileage on FrugalGPT-style numbers should be measured on your traffic, not taken from the abstract. We return to that discipline in Section 18.6.

Never use log-prob as your only confidence signal. It is the most obvious handle, and the least reliable one. A model that is systematically miscalibrated on hallucination will be systematically miscalibrated on log-prob, because they are the same calibration failure seen from two angles. Self-consistency and shape validity are cheaper and more honest.

18.3.1 What escalation looks like in practice

In pseudocode, for a request handler with two tiers and a confidence gate:

def handle(request):
    cheap_answer = cheap_model.generate(request)

    # Layer 1: shape validity (free)
    if not shape_valid(cheap_answer, request.schema):
        return frontier_model.generate(request)

    # Layer 2: self-consistency (k=2, ~1x extra cheap call)
    second_answer = cheap_model.generate(request)
    if not answers_agree(cheap_answer, second_answer):
        return frontier_model.generate(request)

    return cheap_answer

Two calls to the cheap tier on every request, one call to the frontier tier on the escalation fraction. If your cheap tier handles 80% of requests and the escalation fraction is 20%, your effective cost per request is:

2 * cheap_price + 0.2 * frontier_price

versus

1 * frontier_price

for the always-frontier baseline. The break-even is wherever the cheap tier is less than about 40% of the frontier tier’s cost, which is satisfied everywhere below Sonnet in the April 2026 ladder. For the specific pair Flash-Lite / Sonnet, cheap cost is about 10% of frontier, which puts you at a comfortable 7-8x cost reduction at 80% first-tier hit rate, before accounting for latency gains on the 80% that never touched Sonnet.

18.4 Worked cost math: three archetypes

Abstractions are only useful if they attach to real numbers. Three archetypes cover most production Synoros-adjacent workloads. Each has a different tier architecture, and the cost-optimal answer is not the same.

18.4.1 Archetype 1: High-volume extraction

Workload. A million documents a day need structured-field extraction. You pull out the entities, classify them, format the result as JSON. Each document is short (under 2,000 input tokens), each answer is short (under 500 output tokens). Latency is not critical; accuracy on the schema fields is. Invoice triage. Incoming-email classification. Insurance-claim field extraction. In paper-summary terms, picture an arXiv ingestion pipeline that reads every new paper posted today and extracts the authors, affiliations, dataset names, and reported accuracy numbers into a structured record. At the rate arXiv publishes, that is a workload of this shape.

Naive approach. Route everything to Sonnet 4.6. At 2,000 input tokens and 500 output tokens per doc, per-doc cost is $0.006 + $0.0075 = $0.0135. A million documents a day is $13,500 a day. Annually, that is $4.9 million on a pipeline that is mostly filling in the same fields the same way a million times.

Tiered approach. Route 95% to Haiku 4.5, escalate 5% to Sonnet 4.6 on shape-validity failure or self-consistency miss. Per-Haiku-doc cost is $0.002 + $0.0025 = $0.0045. Per-Sonnet-escalation adds $0.0135. Effective per-doc cost is $0.0045 + 0.05 * $0.0135 = $0.0052. A million documents a day is $5,200 a day, or about $1.9 million a year.

Savings. About 2.6x on the total bill. Deployable today with no custom routing infrastructure, just a shape-validator and a retry. A rigorous eval pass on your specific document distribution will tell you whether the 95/5 split holds at your required quality level. In many production extraction workloads, Haiku handles 98% of the distribution and the split is even more favorable.

Opinion. This is the workload where cost-tiered deployment pays off hardest and easiest. If your team is running a Sonnet-only extraction pipeline, the two-week engineering effort to cut it over to tiered Haiku/Sonnet pays for itself in the first month at any serious document volume.

18.4.2 Archetype 2: Interactive chat

Workload. A customer-facing conversational assistant. The hard constraint is the time before the user sees anything: time-to-first-token under 500 ms at p95. The medium constraint is total time to a complete answer: under 3 seconds. Each turn has 500 to 5,000 tokens of context (system prompt, short history, retrieved chunks) and generates 100 to 1,000 output tokens. The quality target is social rather than numeric: the user does not give up and try a competitor. In paper-summary terms, imagine a reading assistant sitting next to a user who is scrolling through a paper live: they highlight a section, they ask a question, and they want an answer before their eye has moved two lines further down the page. Half a second is the budget for that feeling.

Naive approach. Route everything to Sonnet 4.6 on Anthropic’s US cluster. Time-to-first-token is roughly 800 ms to 1.5 seconds at April 2026 observed latencies. That blows the 500 ms hard constraint. Cost is secondary; you have already lost on the primary UX contract before you ever see the bill.

Tiered approach. Primary path: Groq Llama 3.3 70B. First-token latency well under 200 ms, total generation under a second. Fall back to Sonnet 4.6 on confidence miss (a Haiku-scored judge rating the Groq output below threshold, or shape-validity failure on tool calls). Groq at $0.59 / $0.79 per million input/output is about 20% of Sonnet’s rate. Escalation rate in measured deployments hovers around 10-15%.

Cost comparison. Assume 3,000 tokens in and 500 tokens out per turn on average. All-Sonnet: $0.009 + $0.0075 = $0.0165 per turn. Tiered with Groq primary and 12% Sonnet fallback: $0.00177 + $0.00395 (Groq, rounded with Haiku-judge overhead) + 0.12 * $0.0165 = about $0.0077 per turn. Roughly 2.1x cheaper, and a latency profile that ships.

Opinion. Interactive chat is where you pay for latency first, and cost savings come as a bonus. If you are running conversational Sonnet, you are almost certainly losing on user experience in a way that shows up as retention numbers rather than as a dashboard alert. Groq being cheaper is a second-order benefit. Reach for a latency-tier primary on every user-facing conversational path.

18.4.3 Archetype 3: Agentic task completion

Workload. A multi-turn agent with tool access, completing an open-ended task. Research. Code editing. Browsing-with-actions. Twenty to a hundred turns per completed task. Some turns need deep reasoning: decomposing the task at the start, interpreting a tool error, deciding between two strategies. Most turns are mechanical: summarizing a tool output, extracting a field, deciding which tool to call next. Total token usage per completed task runs from 50K up to 500K tokens across all those turns. In paper-summary terms: an agent that is handed not a single paper but a research question, and is asked to find the relevant papers, summarize each one, and synthesize a five-bullet answer across the whole corpus. That is dozens of turns, dozens of tool calls, and the quality of the final answer rides on the quality of each small decision along the way.

Naive approach. Route every turn to Opus 4.7. At an average of 200K tokens in and 50K out across the whole task (a mix of prefix-cache hits and genuine new context, see Section 18.5), total cost per task is roughly $1.00 + $1.25 = $2.25. A hundred agent tasks a day is $225 a day.

Tiered approach. Put Opus on the initial plan-generation turn and on the handful of turns a heuristic flags as genuinely hard. Put Sonnet on the middle-difficulty turns, which is most of them. Put Haiku on the lightweight “summarize the output of this tool call” or “extract the error message from this log” turns. The tier assignment is per-turn, not per-task.

A reasonable distribution on measured agent traces: 10% Opus, 60% Sonnet, 30% Haiku. Weighted cost per 250K-token-equivalent task: 0.1 * $2.25 + 0.6 * $0.70 + 0.3 * $0.24 = $0.74. About 3x cheaper than Opus-only, and anecdotally within noise of the Opus-only quality when the routing decisions are correct.

Opinion. Agentic workloads are the hardest to tier well because the quality of your tier-decision classifier becomes load-bearing. Get the 10/60/30 split wrong and you either blow the cost budget (too much Opus) or degrade the agent’s task success rate (too little). This is the archetype where a learned routing model pays for itself most, as Chapter 17 (Section 17.3) argues. On a hand-tuned heuristic router, expect to leave 20-30% of the possible savings on the table.

18.5 Hidden costs: caching, batching, context

The pricing pages are not the whole story. Three structural mechanisms move the effective price around by large fractions, and a team that is not using them is overpaying on the same pricing pages. These discounts are the reason two teams running identical-looking systems on the same model can pay wildly different bills. Our paper-summary task depends on each of them in a different way.

18.5.1 Prompt caching

The biggest lever available in April 2026 is prompt caching. Anthropic, OpenAI, and Google all offer it. The discount on cached input tokens is roughly 90% versus the uncached input rate (Anthropic 2026a; OpenAI 2026; Google 2026). The trick works when you can identify a stable prefix that repeats across many requests: a large system prompt, a long set of retrieved documents reused across a conversation, a shared tool schema block. Cache hits bill at the discounted rate. Cache misses bill at the normal input rate. Vendors define the cache’s time-to-live differently (minutes on OpenAI’s side, longer with explicit cache breakpoints on Anthropic’s), but the direction of the savings is universal.

Back to the paper-summary task for a concrete picture. Say you are running a reading-assistant feature where users upload a paper and then ask a series of questions about it. Without caching, every question the user asks re-sends the entire forty-page paper to the model, and you pay the full input rate for those 100,000 tokens every time. With caching, the paper gets sent once at the normal rate, the next fifteen questions reuse it at the 90%-discount cached rate, and the difference on a user who asks ten questions in a session is an order-of-magnitude cost reduction. The user experience is identical. The bill is not.

Put a bigger number on it. Suppose your system prompt is 4,000 tokens and it is fixed across 100,000 requests a day. Without caching, you pay $3 per million input tokens on Sonnet times 400 million system-prompt tokens a day, which is $1,200 a day just for the system prompt. With caching after the first hit, you pay roughly $120 a day for the same text. Over a year, that is the difference between $438,000 and $43,800, on one prompt fragment. That number is real money, and it is sitting behind a checkbox in most vendor SDKs.

Systems that leave this on the table are usually doing one of two things wrong. Either they are not using caching at all (common, because the feature rolled out across vendors in 2024-25 and the developer ergonomics are not yet fully polished), or they are rebuilding the prompt fresh on every request and defeating the cache by accident (also common, when the prompt builder interpolates the current timestamp into the system prompt, or includes a per-request UUID, or reorders retrieved documents). Both failure modes look identical on the bill: full-rate charges with no cache-hit discount.

Prompt caching is table-stakes in April 2026. Any system that does not use it on repeated-prefix workloads is leaving 30 to 60% of its spend on the table. Audit your prompt-building code once a quarter to confirm the cacheable prefix is still stable, because the usual failure mode is someone adding a timestamp or a UUID and silently killing the cache hit rate.

18.5.2 Batch APIs

If your workload is latency-tolerant (offline eval, bulk classification, overnight embedding runs, scheduled report generation), the batch APIs offer a 50% discount across Anthropic, OpenAI, and Groq (Anthropic 2026a; OpenAI 2026; Groq 2026). You submit a batch of requests. The vendor returns results within 24 hours. The vendor’s economics work because they schedule your jobs onto excess capacity during low-demand windows. Yours work because you get Sonnet-quality output at Haiku-ish prices for work that does not need to be live.

Back to the paper-summary task. Live use cases (a user is waiting on a page) cannot use batch. But the back-office versions often can. Want to re-summarize your entire internal corpus of papers against a new taxonomy? That is a batch job; the user who eventually reads the output does not care whether the job finished at noon or at 4 a.m. Want to run an eval pass of 10,000 paper-summary examples against a new prompt version before shipping it? Batch. Any Synoros workload that runs on a schedule rather than a webhook should live on the batch API. The 50% discount is free money for latency you did not need.

18.5.3 Long context is priced at a premium

Context length is not free. Google prices Gemini 3.1 Pro at $2 / $12 per million tokens in/out under 200K context, and $4 / $18 above that threshold (Google 2026). Anthropic’s Sonnet 4.6 1M-window tier charges a similar premium above the default window (Anthropic 2026b). OpenAI’s pricing as of April 2026 has not yet introduced an explicit long-context tier, but empirically charges the same headline rate regardless of window size (OpenAI 2026).

Context length is a cost lever, not a free capability. Every additional 100K tokens of retrieved documents that you stuff into Gemini Pro doubles the input rate. To make that concrete with the paper-summary task: if a user uploads ten papers instead of one, and your pipeline naively dumps all ten into the model’s context window, you are paying the long-context tier rate on all 200K tokens of the combined bundle, and you are getting in return the Lost in the Middle effect we covered in Section 15.2, where the model pays attention to the beginning and end of the stuff you pasted in and skims the middle. You are paying a premium to give the model context it is partly ignoring.

A RAG pipeline retrieval-augmented generation, 10 that dumps 500K tokens of retrieval into a 1M window pays the long-context premium on 400K of that, most of which the model will not effectively use. Better retrieval, tighter reranking, a narrower context window submitted to the model, and a cacheable prefix design are all cheaper than buying more context. On the paper-summary task specifically, running a two-stage pipeline (retrieve the three most relevant sections with a cheap search, then summarize only those sections with the model) nearly always beats a one-stage pipeline that pastes the whole paper in and hopes.

18.5.4 Token counting varies by tokenizer

A tokenizer counts tokens, and different vendors use different tokenizers. A “one million token document” measured against Anthropic’s tokenizer is somewhere between 750,000 and 1.2 million tokens against OpenAI’s or Google’s. Cost estimates that ignore this skew can be 20% off in either direction, and the direction of the skew depends on which tokenizer the author happened to use to count. The mitigation is mechanical: always measure your tokens against the target vendor’s own tokenizer (Anthropic’s count_tokens endpoint, OpenAI’s tiktoken library, Google’s count_tokens API) before running the cost math.

This also affects routing math. A request that lands at 4,000 tokens on Anthropic’s tokenizer may land at 4,500 on OpenAI’s. When you compare Sonnet-vs-GPT-5.4-mini pricing, normalize to the same tokenizer (or to character count as a first approximation) before comparing dollars. For the paper-summary workload the effect is small per paper but it compounds when you multiply by a million papers a day. Doing this wrong can flip the cost-optimal choice. Doing it right is a half-day of engineering, once.

18.6 What to measure, and how often to measure it

Cost-tiered deployment is a moving target. The three loads that keep moving it:

Pricing. Vendor pricing changes on a quarterly-to-semiannual cadence. Anthropic reprices with every Claude generation. OpenAI reprices with every GPT generation. Google’s Gemini pricing moved twice during 2025. A cost architecture that was optimal in January 2026 is almost certainly no longer optimal in April 2026, and it will be less optimal by July. A pricing-snapshot file (pricing-snapshot-YYYY-MM-DD.md in the book repository) lives next to this chapter. Refresh it quarterly.

Model capability. The ceiling of each tier climbs over time. What Haiku 4.5 can do today is what Sonnet 4.6 could barely do eighteen months ago. The load-bearing consequence is that the tier depth required for a given workload erodes. Our paper-summary task is a clean illustration: two years ago it needed Sonnet. Today Haiku does it with minor quality loss. In another year it will probably be handled adequately by Flash-Lite or by an open-weight 70B running locally. A workload that needed Sonnet in 2024 may be adequately served by Haiku in 2026 and by Flash-Lite by late 2027. Revisit your routing decisions whenever a new model lands at a tier, not just when prices change.

Workload distribution. Your user base evolves, your product evolves, new features pull new query classes. The tier split that was 95/5 Haiku/Sonnet in Q1 may be 80/20 in Q3 because a new feature dropped harder queries into the distribution. Instrument your router, track the escalation rate as a first-class metric, and alert on rapid shifts.

The cadence recommendation: measure the escalation rate weekly, refresh the pricing snapshot quarterly, re-run the full tier-architecture decision annually or on any major model release. That is enough discipline to keep the cost math current without generating make-work.

This chapter will be out of date in six months. That is not a flaw in the argument. It is a property of the subject matter. Every specific number we quote is stamped April 2026. The tier structure, the confidence-fallback logic, the caching/batching/context hidden-cost framing, and the three-archetype cost math will keep working long after the dollar figures drift. Treat the arithmetic as scaffolding and the numbers as load you re-weigh each quarter.

18.7 Putting the pieces together

Cost-tiered deployment is not a single decision. It is a discipline composed of four moves that reinforce each other:

Know your ladder. Six tiers, from local to frontier, each with a specific job. Pick the tiers you need, not all six.
Know your workload. Easy / medium / hard distribution, measured on real traffic, not assumed.
Route on confidence, not on hope. Output-shape validity, self-consistency, and in some cases judge-model scoring. Never log-prob alone.
Claim the hidden discounts. Prompt caching, batch APIs, careful context-length budgeting. On many workloads, these move more money than the tier-choice decisions do.

Do all four and the payoff on a typical Synoros-adjacent production workload is somewhere between 3x and 10x cost reduction versus a naive all-Sonnet baseline, at equivalent user-visible quality. Go back to the paper-summary task one last time. The reader who pastes a forty-page paper in and gets five bullet points back does not know which tier answered the question, whether the prompt was cached, whether the response came from a batch job or a live call. They get the same summary either way. The difference shows up on your side: on a seven-figure API bill, the difference between the naive and the tiered setup is a life-changing number for a small team. On a five-figure API bill, it is the difference between a feature that pays for itself and a feature that becomes an expense line. Either way, the arithmetic wants you to do the work.

Anthropic. 2026a. Claude API Pricing. Official pricing page, snapshot April 2026. https://platform.claude.com/docs/en/about-claude/pricing.

Anthropic. 2026b. Long Context Prompting and Context Management. Current product documentation. https://docs.anthropic.com/en/docs/build-with-claude/context-windows.

Chen, Lingjiao, Matei Zaharia, and James Zou. 2024. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” Transactions on Machine Learning Research. https://arxiv.org/abs/2305.05176.

Google. 2026. Gemini API Pricing. Official pricing page, snapshot April 2026. https://ai.google.dev/gemini-api/docs/pricing.

Groq. 2026. Groq on-Demand Pricing for Tokens-as-a-Service. Official pricing page, snapshot April 2026. https://groq.com/pricing.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.

OpenAI. 2026. OpenAI API Pricing. Official pricing page, snapshot April 2026. https://openai.com/api/pricing/.

Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.

Various 2025–2026 TCO analyses. 2026. Local LLM Vs Cloud API Cost Analyses (2025–2026 Industry Writeups). Collected industry writeups (SitePoint, Introl, DEV.to, etc.).

Yue, Murong, Jian Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-Efficient Reasoning. arXiv:2310.03094v3. https://arxiv.org/abs/2310.03094.