Part V — Bare Metal

Local vs. Cloud

You have a forty-page research paper open on your screen. You want five bullet points. You can paste the text into Claude, into Gemini, or into ChatGPT and get an answer back in two seconds. Or you can run it through a 9B open-weight model on a $349 card sitting in the desktop under your desk and get an answer back in roughly the same two seconds.

Which one is right?

The honest answer is not “local.” It is not “cloud.” It is “it depends on four things about your workload, and the hybrid pattern is eating most of the middle.”

Section 4.3 walked the physics. VRAM [the GPU’s on-card working memory, separate from the CPU’s RAM] sets a hard ceiling on which models fit at all. Memory bandwidth [how fast bytes move between that memory and the compute units] sets the decode ceiling [the rate at which the model can produce each new word once it is running]. Quantization [squeezing the model’s numbers into fewer bits so more of it fits and moves faster through the bandwidth bottleneck] is the lever that shifts both ceilings at once. Put those together and local hardware sits closer to viable than the industry narrative admits. The question the physics left open was this one: given a workload in front of you, how do you decide?

Take the paper-summary task and push on it for a minute. The forty pages are the same forty pages whether you paste them into an API or send them to a local model. But the context around the task changes everything about the answer. Are you doing this twice a week, or two hundred times an hour? Do you need the first bullet to appear in fifty milliseconds or is two seconds fine? Is the paper an arXiv preprint anyone can read, or a draft of an unreleased clinical trial? Are you the only person asking, or are two hundred analysts sharing one server? Four questions. Four axes.

One caveat up front, because it will save the reader from being sold numbers that do not apply to them. The published literature on local-versus-cloud economics is written for enterprise operators. Shi et al.’s 2025 cost-benefit paper (Shi et al. 2025) and Lenovo’s 2026 TCO [total cost of ownership] report (Lenovo Press 2026) both assume a team of engineers paid enterprise salaries and workloads that keep a cluster of H100s [Nvidia’s 2023 datacenter GPU, roughly $30,000 per card] running around the clock. That is not a solo developer with a $349 consumer card. That is not a small startup doing a few tens of requests per second. That is not a mid-sized team running a 70B model in a closet. The published numbers are not wrong. They are answering a question most readers of this book are not asking. Where we can, we say so, and we redo the arithmetic at the scale that applies.

21.1 The four axes of the decision

Here is the move that saves more time than any other move in this chapter. Before you argue about local versus cloud, write down four numbers for your workload. Not opinions. Numbers. The four numbers collapse most of the decision before you have said anything ideological.

Write them down now, for the paper-summary task as you do it:

Utilization. How many hours per day will the hardware be busy? A GPU pinned at full load twenty-four hours a day amortizes in months. A GPU that does thirty summaries at 9pm and sits idle the rest of the day amortizes slowly. The mistake here is reading peak load and planning for average. A workload that bursts to 100 requests per second for one hour a day and sits idle the other twenty-three hours has a different cost profile from one that runs steadily at ten requests per second. Cloud pricing scales with utilization because cloud is rent. Local pricing scales with duty cycle because local is ownership.

Think about how you would use the paper-summary tool. If it is a weekend habit for catching up on arXiv, you will use it twenty minutes a day and call it done. If it is bolted into a research-assistant product that a thousand users ping all day, the hardware never stops. Two completely different workloads, identical task.

Latency requirement. How long can the user wait? Specifically, first-token latency [the wait between sending the prompt and seeing the first word] and inter-token latency [the wait between each subsequent word as the response streams in]. Cloud APIs in April 2026 have a structural floor of roughly 300 to 800 milliseconds first-token for frontier models like Opus 4.6, Gemini 3.1 Pro, and GPT-5.4. Flash-tier models like Haiku 4.5 and Gemini 2.5 Flash sit in the 100 to 200 millisecond band. Those floors come from network round-trip plus prefill [the model reading through your whole prompt before it starts writing the reply] plus queueing inside the provider’s stack, and they are not going below the speed of light any time soon. Local inference on a consumer card can push first-token under 50 milliseconds for a 7B to 14B model, and the tail of the latency distribution tightens because you removed the open Internet from the critical path.

For the paper-summary task, two seconds versus fifty milliseconds is invisible to a human reading bullets. For a real-time voice agent, the same difference is the difference between a conversation and a phone tree.

Privacy and sovereignty. Can the tokens leave your perimeter at all? Some workloads are regulated: HIPAA for health data, GDPR Article 44 for EU personal data, ITAR for defense-related information. Some are contractually bound by a DPA [Data Processing Agreement, the contract that governs what a cloud vendor can and cannot do with your data] that forbids third-party sub-processors. Some are strategically sensitive: source code for unreleased products, internal legal deliberations, competitive intelligence. A cloud provider with a BAA [Business Associate Agreement, the HIPAA-specific contract that lets a vendor legally process health data] or DPA partially solves compliance. It does not solve sovereignty, which is the question of whether the bytes ever physically leave your machine. And the strongest form of this constraint, airgap [a machine with no network connection at all, physical or logical], is absolute: no network, no cloud, full stop.

The paper-summary task sits on both sides of this line. A forty-page arXiv preprint goes anywhere. A forty-page draft of your employer’s unreleased clinical trial does not. Same task shape. Different rules.

Concurrency shape. Single user at a time? Many users sharing a box? Batch jobs? Single-user decode at batch 1 [one request running through the GPU at a time] is the workload consumer hardware handles elegantly. The bandwidth-bound arithmetic from Section 4.3 makes a $349 card at 320 GB/s running a 4-bit 9B model feel responsive to one person. Multi-tenant serving at high queries-per-second amortizes the bandwidth of one GPU across many requests through continuous batching [a serving technique that lets new requests slot into the same GPU pass as existing ones, so a hundred users can share the same forward pass] (Kwon et al. 2023; Kwon 2025). That is where datacenter GPUs (H100, H200, MI300X [AMD’s 2024 datacenter counterpart to the H100]) earn their price. Their bandwidth runs three to ten times a consumer card’s, and the batching window to amortize across is much wider. If you need to serve two hundred concurrent users against a strict latency SLA [service-level agreement, the contractual bound on how slow a response can be before you are in breach], consumer hardware is the wrong answer even when the per-user workload would fit.

Two hundred analysts all asking for a paper summary at 9am Monday morning is a different workload from one analyst asking once an hour, even if each individual request is identical.

These four axes compose. High utilization plus single-user plus privacy-sensitive tilts hard toward local. Bursty plus multi-tenant plus general-purpose tilts hard toward cloud. Everything in the middle is where the hybrid pattern lives, and the hybrid pattern is where most production systems ended up. Section 21.5 picks that up.

Name the four axes before arguing about the answer. A conversation about “should we run this locally” that has not specified utilization, latency, privacy, and concurrency is not a technical argument. It is a preference argument. Write the four values down for your own workload and most of the decision writes itself.

21.2 The cost-curve crossover

Given the four axes, the cost math is clean in shape and treacherous in the numbers you plug in. The break-even is:

(amortized hardware cost + electricity + operator time) per unit of work, compared to (API price per token) times (tokens consumed).

Everything interesting is in the definitions of the terms.

21.2.1 The enterprise picture

Shi et al.’s 2025 paper (Shi et al. 2025) is the first peer-reviewed treatment of the on-premise break-even question. Their model covers an eight-GPU H100 cluster running a 70B-class model at typical enterprise utilization. The headline number, subject to their assumptions, is a break-even against commercial APIs at roughly 8,556 hours of continuous utilization. That is about twelve months of non-stop load. Beneath the number sits a long list of assumptions: enterprise-grade power and cooling, depreciation over a specific horizon, a specific ratio of input to output tokens, specific commercial API prices frozen at a point in time. Change the assumptions and the break-even moves.

Lenovo’s 2026 TCO report (Lenovo Press 2026) runs a broader scenario model with similar structure and similar assumption sensitivity. The vendor-ness of the source shows, Lenovo sells on-premise hardware and has an interest in the answer, but the methodology is disclosed and the numbers are checkable. The honest read: for a real enterprise with a real datacenter, on-premise inference pays back in roughly a year of sustained load, after which the hardware keeps producing tokens at marginal cost [the incremental cost of generating one more token, once the hardware is paid for]. Enterprise operators running this math today are not wrong.

21.2.2 The solo-operator picture

The enterprise picture does not describe most readers of this book.

If you are one developer, or a team of three, or a pre-revenue startup deciding whether to build against an API or buy a consumer card, Shi and Lenovo are answering a different question. The headcount assumption alone breaks for you. A solo operator does not amortize their own engineering time at $800,000 a year, and the ROCm [AMD’s GPU-compute software stack, the open counterpart to Nvidia’s CUDA] or CUDA [Nvidia’s GPU-compute software stack, the industry default] debugging session that burns a Saturday afternoon costs you differently than it costs a cloud provider with a platform team.

Let us run the arithmetic for a solo operator on the paper-summary task, concretely. Take a $349 consumer card, the AMD RX 9060 XT we use as the working example throughout this book (Section 23.4 has the wider landscape). Add $200 to $400 for power supply, cooling, and electricity amortization. Call the all-in number $600, paid once. Now run the paper-summary task and see how fast that $600 pays itself back.

Forty pages of paper is roughly 20,000 input tokens. A five-bullet summary is a few hundred output tokens. Call it 25,000 tokens round-trip per summary.

At April 2026 flash-tier prices, Gemini 2.5 Flash runs $0.15 per million input tokens and $0.60 per million output tokens (Google 2026). One paper summary on Flash costs roughly four-tenths of a cent. A hundred summaries a day costs forty cents. A thousand summaries a day costs four dollars.

Watch what the $600 card does to those numbers:

  • One summary a week, fifty-two a year: Flash costs you twenty cents a year. The card never pays back. Ever. Cloud wins forever.
  • Ten summaries a day: Flash costs four cents a day, fifteen dollars a year. The card takes forty years. Cloud wins.
  • A hundred summaries a day: Flash costs forty cents a day, a hundred fifty dollars a year. The card takes four years. Cloud still wins for most people.
  • A thousand summaries a day, because you bolted the task into a product: Flash costs four dollars a day, fifteen hundred dollars a year. The card pays back in five months. Local wins, cleanly.

That is the curve. For light personal use, cloud is cheaper than local and it always will be. For heavy daily work, especially work you have bolted into a product, the card pays back in a few months and keeps producing tokens for free afterwards. The dividing line is not whether local is viable; it is how hard you push the hardware. Appendix 3 carries the worked numbers across nine paths (local-on-owned-card, local-on-rented-A5000, four cloud tiers from Gemini 2.5 Flash through Claude Opus 4.7) for the standard paper-summary task.

21.2.3 The variable nobody wants to price

One term in the break-even calculation is almost always omitted and almost always dominant: operator time.

A $1,500 GPU amortizes in eighteen months of hobby use. An engineer spending a weekend debugging ROCm kernel crashes amortizes much slower, especially when that engineer is you, and especially when the workload cares about reliability. The cloud API hides this cost completely. Someone else is paid to debug the kernel, and you will never meet them. The local setup exposes it. You are the person paid to debug it, and the pay is not great.

This is not a knock on local. The book’s thesis is that the local path is viable and under-served, and we spend Section 4.4 through Section 23.4 on the specifics. But the operator-time term is real, and the honest version of “local pays back” is: local pays back when your workload is substantial enough to make the hardware sweat, your hardware stack is stable enough that you are not debugging it weekly, and the capabilities you need fit in the 7B to 70B open-weight range. All three conditions matter. If any one of them is missing, the cloud math wins, because the cloud is paying an engineer somewhere to solve the problem you would otherwise be solving on Saturday.

Three operator scales, three different math problems. A solo developer, a small team of 10 to 50 engineers, and an enterprise with a platform team are not running the same break-even calculation. Published TCO studies, with one or two exceptions, implicitly model the third. If you are in the first or second bucket, treat those studies as a rough upper bound on the cost of being wrong, not as your own spreadsheet.

21.3 Latency and the real-time constraint

Cost is one reason to pick local. Latency is the other, and for some workloads it is the only reason that matters, because no amount of cloud budget makes a round-trip to a datacenter faster than the speed of light.

Here is the strangeness to sit with before we look at numbers. The top frontier models in 2026 can do things the local models cannot. But they live in a data center in Oregon or Virginia. To ask one anything, your prompt has to travel from your laptop through your router, through your ISP, across the continent, through load balancers, into a queue, onto a GPU, through prefill, out the other side, and all the way back. Light is fast but it is not instant. The round trip has a floor. For the paper-summary task, the floor is invisible. For a voice agent waiting for you to finish a sentence, the floor is the whole problem.

21.3.1 The cloud floor

Cloud-API latency has a structural floor made of network round-trip (typically 20 to 80 milliseconds depending on geography), TLS and HTTP overhead (a few milliseconds per hop), queueing inside the provider’s serving stack (variable, bounded below by scheduling granularity), prefill over your prompt (scales with prompt length, compute-bound per Section 4.3), and the first decode step. For frontier models in April 2026, Artificial Analysis’s latency tracker [an independent service that measures and publishes provider-by-provider latency and price numbers, refreshed weekly] (Artificial Analysis 2026) puts the median first-token latency in the 300 to 800 millisecond range, with the tail reaching into seconds when providers are under load. Flash-tier models sit in the 100 to 200 millisecond band at the median, with a tighter tail.

Those are the good numbers. During the periodic provider outages, capacity incidents, and regional-failover events that any operator running against cloud APIs for more than a year has seen, the tail blows out by orders of magnitude or the service returns errors. Cloud APIs are not unreliable in absolute terms. They are unreliable as a dependency you can build hard latency guarantees against. Sagan’s first check in the Baloney Detection Kit is “look for independent confirmation of the facts” (Sagan 1995). The providers publish their own latency numbers; Artificial Analysis measures from the outside and publishes different ones. Believe the outside measurement.

21.3.2 The local floor

Local inference on a consumer card removes network latency entirely. The floor becomes prefill plus the first decode step, governed by the physics from Section 4.3. For a 7B to 14B model at 4-bit on a card like the 9060 XT, first-token latency sits in the 30 to 100 millisecond range for typical prompt lengths. Inter-token latency is set by the bandwidth-bound decode rate, roughly 30 to 70 tokens per second for this model class on this card, so each subsequent token arrives every 15 to 35 milliseconds.

Those numbers are comparable to cloud on first-token and better on tail reliability. Rajput et al.’s 2025 Apple Silicon measurement (Rajput et al. 2025) puts the numbers even better on the unified-memory [CPU and GPU share the same RAM, so there is no copy step between them] path: MLX [Apple’s array and machine-learning framework for Apple Silicon] on an M2 Ultra reaches around 230 tokens per second on small models, faster than many cloud flash-tier APIs deliver in practice.

Run the paper-summary task through a local model at those numbers. Twenty thousand input tokens take roughly two seconds to prefill. The first bullet arrives in another fifty milliseconds. The five-bullet summary streams out over the next three to five seconds. Two hundred milliseconds faster than cloud on first-token is invisible. But what matters is the floor: the local path does not have a tail. It does not spike to six seconds during a capacity incident in us-east-1. It does not fail because a load balancer fell over in a region you have never heard of.

21.3.3 Workloads where latency is the answer

Three classes of workload care about latency more than any other axis.

Real-time voice. A push-to-talk or open-microphone voice agent needs sub-200-millisecond response for the interaction to feel conversational. At 300 to 800 millisecond cloud first-token, every exchange has an awkward pause. At sub-100-millisecond local first-token, the conversation flows. This is not a cost optimization. It is the difference between a voice agent that feels like a person and one that feels like a phone tree.

Coding auto-complete and live assistance. When the user is typing and expecting completions, every 100 milliseconds of latency translates directly to perceived sluggishness. Copilot-style products run against cloud APIs because the capability-to-latency ratio has been favorable enough. But the local version, Qwen 3.5 Coder or DeepSeek-Coder running on a local 16 to 24 GB card, beats the cloud version on latency cleanly, and the capability gap for routine code completion has mostly closed.

Embedded or edge deployment. Hardware with intermittent or zero network connectivity (field devices, airgapped facilities, moving vehicles) cannot use cloud APIs at all. The latency consideration merges with the sovereignty one in Section 21.4.

None of this says cloud latency is “bad.” For a workload where a two-second wait is fine, cloud is fine. The paper-summary task is exactly that workload, and nobody should buy a GPU for it alone. For a workload where 100 milliseconds feels sluggish, local is often the only viable answer.

21.4 Privacy, airgap, and regulatory constraints

Some workloads cannot send tokens to a third party. Not “preferably should not.” Cannot.

Take the paper-summary task one more time and change one variable: the paper. Same task. Same five bullets. Same tokens, roughly. But now the paper is the draft of an unreleased clinical trial your employer is filing with the FDA next month. Or a pending patent application. Or a set of internal M&A deliberations a court would eventually subpoena. The task is identical. The rules are not. Pasting those forty pages into a cloud API is a career-ending mistake in one world and a ten-second convenience in the other.

The categories where the tokens cannot leave:

Regulated data. HIPAA-covered health information. GDPR-covered EU personal data, particularly where Article 44 restricts transfers to third countries. ITAR and EAR export-controlled technical data. Attorney-client-privileged legal documents. Information covered by a contract-specific DPA that forbids sub-processors. Cloud providers address the compliance layer through BAAs, DPAs, regional residency options, and customer-managed keys. Those address the legal question partially. They do not address the technical question of whether the data stays on your hardware.

Sovereign or classified. Defense contractors, intelligence agencies, national health systems, critical-infrastructure operators. Third-party inference is either forbidden outright or allowed only within a narrow set of accredited providers. The question is not cost or latency. The question is whether inference is permitted at all.

Commercially sensitive without a regulation. Internal source code for unreleased products. M&A deliberations. Competitive intelligence. Personnel matters. Legal may not forbid sending this to a cloud API, but a security review will, and a security review that has not happened is a mistake waiting to happen. Even with best provider logging and retention controls, the provider’s employees, subprocessors, and incident-response processes are all trust surfaces you did not have when the bytes stayed on your machine.

Airgap. The strongest form. No network. Ever. Medical imaging workstations. Industrial control systems. Intelligence-community facilities. Some financial trading environments. The inference happens locally because there is no “cloud” to reach.

For these workloads, the question is not “is cloud cheaper.” The question is “is cloud legal and auditable.” Often the answer is no, and the local-versus-cloud decision becomes local-or-nothing. The Synoros platform takes the strict version of this as a first-class operating model. Some tenants have data that cannot leave the perimeter, period, and the system is designed around that constraint rather than treating it as an exception. We do not survey which cloud providers meet which compliance regimes here. That is a moving target no book can track accurately, and the strongest signal is to read your own regulator’s guidance and the provider’s current sub-processor list, not a chapter written a year ago.

One nuance. “Local” for privacy purposes means the inference runs on hardware you control. That does not mean the model weights have to be private, and it does not mean you have to give up capable models. Qwen 3.5, Llama 4, DeepSeek V3, and Mistral Small 3 are all open-weight models you can download, run on your own hardware, and point at your own data without a single token leaving the perimeter. This is the single biggest reason open weights matter strategically. They open the privacy-constrained workload class that closed APIs cannot serve, at any price.

For some workloads, “is local viable” is not the right question. The question is “is cloud legal for this workload at all.” If the answer is no, the local economics and latency stop mattering. You are running local, and the rest of Part V is about how to do it well.

21.5 The hybrid pattern

Most real production systems in 2025 and 2026 are neither purely local nor purely cloud. They are hybrid, and the hybrid pattern has converged to a recognizable shape.

The shape is this. A router [a small program that sits in front of the model layer and decides which backend handles each incoming request] classifies every incoming request by some combination of sensitivity, required capability, latency tolerance, and cost. Low-complexity, high-volume, or privacy-sensitive requests go to a local model. Requests that require frontier-model capability (hard reasoning, long-horizon planning, out-of-domain knowledge, rare languages) go to a cloud API. A fallback path routes cloud-bound requests to local if the cloud is unavailable, and the reverse if the local hardware is saturated.

Here is what that looks like for the paper-summary task inside a real product. A user uploads a forty-page paper and asks for five bullets. The router inspects the request: short prompt, short requested output, no tool calls, not flagged as sensitive. It sends the job to a local 9B model because the local model can handle it, because local is faster on first-token, and because it costs the product nothing per request. Five seconds later, the same user asks: “Now compare this paper’s claims against the three studies it cites.” The router inspects the new request: needs retrieval over the web, needs careful multi-step comparison, needs tool calls. It sends this one to Claude Opus 4.6 because Opus is better at the comparison and the tool-calling overhead is worth the per-request cost. The user sees one assistant. Two different models did the work. The user does not know and does not need to know.

The implementation anchor on the local side is an OpenAI-compatible serving endpoint [a web API that accepts the same request shape as OpenAI’s servers, so any client written to talk to OpenAI works against it unchanged]. vLLM (Kwon 2025) exposes this by default, llama.cpp’s server exposes it, Ollama exposes it, MLX-based serving exposes it. On the cloud side, the frontier providers all ship OpenAI-compatible endpoints or can be adapted with a few lines of glue. A router like litellm, costa-router, or a hand-rolled dispatcher needs to decide only one thing per request: which backend serves it. Because the protocol is the same on both sides, the dispatcher is small and the swap is trivial.

MCP, the Model Context Protocol [a standard that describes the tools a model can call and the shape of their inputs and outputs, adopted across the major frontier providers through late 2025], makes the tool-calling layer routable across backends too. A tool defined once can be called by a local model or a cloud model without reimplementing it on each side. This is the piece that matters for agent-shaped workloads, where the model is not just generating text but calling out to tools, and the tool layer needs to work identically across the local-cloud split.

The argument for the hybrid pattern is not ideological. It captures the best of both sides for a workload with mixed characteristics, which is what most real workloads are. Most requests in a production system are easy, high-volume, and either private or cheap enough to run locally. A small fraction are hard enough to need a frontier model and worth paying for. Routing those fractions separately costs less per token on average than cloud-only, delivers lower latency on average than cloud-only, and falls back cleanly when any single backend has a bad day.

Three caveats on the pattern.

First, it is practitioner consensus, not peer-reviewed consensus. No published paper we know of formally characterizes the hybrid pattern’s economics at the scale we describe. Treat the framing as “what production systems are converging to in 2026” rather than “what the literature has proven optimal.” Sagan’s check: who benefits from the claim being true? We do. That is a reason to hold the claim with some skepticism. Check it against your own system when you measure.

Second, the router adds latency. A classification step in front of the model layer, even a cheap one, is another 5 to 50 milliseconds. For the paper-summary task this is invisible. For real-time voice it can matter. Implementations that care about latency either run the classifier cheaply (regex, tiny embedding model) or do speculative routing, which means sending the request to both backends in parallel and taking whichever returns first. Useful rarely.

Third, the complexity cost is real. Two model backends are more code to maintain than one, and the routing decision is another place bugs hide. The hybrid pattern earns its complexity for medium-to-large systems. For a single-user solo-operator setup, it is often simpler to pick one side and stick with it.

The dominant production pattern in 2026 is hybrid: local for volume and privacy, cloud for frontier capability, and an OpenAI-compatible router between them. This is what most teams with real workloads have converged to, independently, because it is what falls out of taking the four axes seriously.

21.6 When local loses: the honest concession

This book is partial to the local path, and we have said so. The partiality is not unconditional. There are workloads where local loses today, and a pluralist treatment has to name them instead of quietly walking past them.

Frontier capability. As of April 2026, the best open-weight models, Qwen 3.5 at 70B or larger, Llama 4 at frontier size, DeepSeek V3, Mistral Large, are genuinely close to frontier closed models on most general-purpose tasks. “Close” is not “equal.” The AA-Omniscience benchmark (Artificial Analysis 2025) puts Gemini 3.1 Pro at a top score of 33 out of 100 on a knowledge-reliability test that penalizes wrong answers, and open models below 9B parameters score well below the frontier tier. Run the paper-summary task on the local 9B and you get five good bullets. Run “compare this paper’s novel claim to the three closest papers in the literature it does not cite” on the local 9B and it confabulates references that do not exist. Run the same comparison on Opus 4.6 or Gemini 3.1 Pro and you get real references, usually correct. The honest version of “local is viable” is viable for a large and growing fraction of tasks, not all tasks.

Multi-tenant serving at high volume. Two hundred concurrent users with strict P95 latency SLAs [P95 meaning the slowest 5% of requests must still meet the latency bound] is where datacenter hardware earns its price. A cluster of H100s or MI300X with continuous batching and PagedAttention [a memory-management technique that lets a serving stack pack more concurrent requests onto one GPU than naive memory allocation allows] will beat an equivalent dollar-spend of consumer cards. The bandwidth ratio and the batching amortization are both on the datacenter side. Consumer cards at 320 to 1,800 GB/s of bandwidth are strong for single-user and small-batch workloads. They are not the right shape for high-QPS [queries per second] multi-tenant serving. This is not a statement about open versus closed weights. A 70B open-weight model served to hundreds of users at once benefits from datacenter hardware just as much as a closed model does. It is a statement about where the economics of dense matmul [matrix multiplication, the single operation that dominates LLM compute] fall out on different hardware.

Rapidly-changing workloads. Cloud elasticity is real. If your request rate swings between zero and peak by 100x in a day and you cannot predict the peak, building for peak locally wastes money most of the time, and building for median locally drops requests at peak. Cloud APIs rent you capacity on demand and do the elasticity work for you. Workloads where you cannot characterize the load shape are cloud workloads until you can.

Unknown load shape. The special case of the previous point. If you do not yet know what your workload looks like, because the product is new, or the users are new, or the feature is untested, start on cloud and measure. You defer the hardware commitment until you know what hardware you need. Migrating hot paths to local after measurement is a straightforward move once the shape is clear. Committing to local before you know the shape is how you end up with a shelf of underused GPUs.

Frontier-specific capabilities. Some frontier-model features, native long-context (1M+ tokens on Gemini 2.5 Pro), sophisticated tool-use agents trained on proprietary agent data, multi-modal capabilities at the edge of what the state of the art can do, do not have open-weight equivalents yet. If your workload depends on one of these specifically, local is not an option for that specific piece, and the hybrid pattern from Section 21.5 is the compromise: local for everything else, cloud for the frontier-specific piece.

None of these is a reason to dismiss local wholesale. They are reasons to characterize your workload honestly before picking a side. A book that sold local as a universal answer would be lying. This one is not going to.

21.7 The decision in one page

Pulling the threads together, the fast version of the decision procedure looks like this.

Write down the four axes for your workload. Utilization (hours per day of real load). Latency target (first-token and inter-token tolerance). Privacy constraint (can tokens leave your perimeter). Concurrency shape (single-user, multi-tenant, batch). Four numbers. The four numbers narrow the field dramatically.

If the privacy constraint is strict (HIPAA, GDPR Article 44, ITAR, airgap, strategic sensitivity without a compliance-approved cloud path), you are running local. Skip the rest of the procedure and go to Section 23.4 to size the hardware. The paper-summary task on a draft clinical trial lands here, same as the paper-summary task on an unreleased patent.

If the latency target is tight (sub-200-millisecond first-token for voice, coding assist, real-time interaction), local has a decisive advantage you cannot buy around. Unless the capability requirement forces cloud, run local.

If the capability requirement is frontier-specific (hardest reasoning, rare languages, long context, a specific closed-model feature with no open equivalent), and the volume is low enough that API spend is tolerable, run cloud. For volume, layer the hybrid pattern on top: local for the volume, cloud for the frontier-specific pieces.

If the concurrency shape is high-QPS multi-tenant with strict SLAs, and you do not have a datacenter, run cloud. Reconsider when your volume grows to the point where operating your own datacenter is in scope, which for most operators is never.

If none of the above forces a side, the cost calculation decides, and the cost calculation is workload-specific enough that the only honest answer is to measure. For a solo operator with a consumer card, measure against your real volume at April 2026 prices, the way we did for the paper-summary task above. For a small team, measure against your volume plus your operator-time budget. For an enterprise, the Shi and Lenovo papers are the starting point, subject to the caveat that their assumptions are load-bearing and probably not yours.

The hybrid pattern absorbs the gray zone. If you cannot decide, start cloud (lowest friction, no commitment), measure for a quarter, then migrate the hot paths to local when the shape is clear. This is how most production systems we have seen got here, including Synoros’s.

The local-versus-cloud decision is a workload characterization exercise, not an ideology. If you have written down the four axes and the answer is still not obvious, the hybrid pattern is almost always the right compromise, and measurement is almost always the missing input. Measure, route, reconsider. That is the whole game.

The through-line of Part V is that local inference in 2026 is a real option for a large and honestly-bounded class of workloads, the tooling has matured to the point where operator effort is manageable, and the economics favor ownership for anyone with sustained load. None of this means cloud is wrong. It means the choice is not binary and the crossover points moved.

Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.
Artificial Analysis. 2026. LLM Pricing and Latency Tracker. Updated continuously. https://artificialanalysis.ai/.
Google. 2026. Gemini API Pricing. Current product documentation. https://ai.google.dev/pricing.
Kwon, Woosuk. 2025. vLLM: An Efficient Inference Engine for Large Language Models. EECS-2025-192. UC Berkeley EECS. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.
Lenovo Press. 2026. On-Premise Vs Cloud: Generative AI Total Cost of Ownership (2026 Edition). Lenovo Press report LP2368. https://lenovopress.lenovo.com/lp2368.pdf.
Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.
Sagan, Carl. 1995. The Demon-Haunted World: Science as a Candle in the Dark. Random House.
Shi, Jiawei et al. 2025. A Cost-Benefit Analysis of on-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services. arXiv:2509.18101. https://arxiv.org/abs/2509.18101.