Part I — The Reality

The Physical Model

An LLM is math, and that math has to run on a piece of metal somewhere, and the metal turns out to decide almost everything you care about.

The previous chapter (Section 3.1) told you what an LLM does. It reads a sequence of tokens, it outputs a probability distribution over what comes next, and its behavior is set by a pile of learned numbers called weights. That is the cognitive half of the object. This chapter is the physical half. It is about the metal the math runs on, and how the shape of that metal pushes back on everything else.

The math itself is simple to describe. Most of it is matrix multiplication [grid-by-grid number crunching, multiply one rectangle of numbers by another, sum each cell], with a handful of elementwise operations [operations that touch each number on its own, no cross-talk between them] thrown in between. Repeat the whole stack layer after layer, for every token you produce. “Inference” [running the model once on a prompt to produce output] is the name for doing that. When you press enter on a prompt, several trillion floating-point operations happen on a piece of hardware somewhere, a datacenter GPU, a consumer card under your desk, an Apple Silicon chip in a laptop, or a custom dataflow accelerator [a chip designed so data flows through dedicated hardware stages instead of being shuttled back and forth to general-purpose registers] built by a company you have not heard of.

Most of what operators get wrong about inference traces back to not having a clear picture of the physical side. How much VRAM [video memory, the fast local memory sitting next to a GPU’s compute units] a model “really” needs. Why a smaller model can feel slower than a larger one. Why quantization [storing the model’s weights in fewer bits per number, trading a little accuracy for a lot of memory] works. Why local inference keeps getting more viable while the industry narrative says it is impossible. Once you know which operations are cheap, which are expensive, and what “expensive” means on modern hardware, the mysteries collapse. Quantization stops being magic. VRAM sizing stops being folklore. The choice between local and cloud stops being a vibe and starts being a calculation.

Come back, for one paragraph, to the forty-page paper you want summarized. That task has been the running example since Chapter 1. Prompting changes how you ask. Retrieval lets the model pull the paper’s references out of the open web. Evals compare models on the same summary. All of that happens above the metal. This chapter is below it. The question now is a simple one. When you paste the paper in and ask for five bullet points, what actually happens on the silicon, and why does it take the shape it does?

4.1 GPUs, and why

Transformer inference is almost nothing but dense matrix multiplication. Every layer in the model computes three projections per token (the query, the key, and the value). That is three matrix multiplies per attention head group. The feed-forward network in each block is two more large matmuls with a nonlinearity in between. The final step, which turns the model’s internal state into scores for every possible next word, is one more matmul against a matrix shaped [hidden, vocab_size], where the second dimension is about 100,000 wide. Everything else the model does, the layer norms, the softmax, the residual adds, the nonlinearity itself, rounds to zero in the compute budget. A transformer is a matmul machine with a few seasonings.

That matters because GPUs are matmul machines in a way CPUs are not.

A modern CPU, when everything goes right, sustains a few hundred GFLOPS [billion floating-point operations per second] on dense FP32 [32-bit floating-point numbers, the traditional default for scientific computing]. A modern GPU of the same vintage, an RTX 4090 or a datacenter H100, does tens of TFLOPS [trillion floating-point operations per second] on the same operation. With tensor cores engaged at lower precision, it does hundreds. The gap is not 2x. It is often 50x to 100x. GPUs hit over 90% of their theoretical peak on a well-shaped dense matmul. CPUs rarely clear 20% of theirs on the same kernel, because most of a CPU’s silicon is spent on branch prediction, out-of-order execution, and large caches built for latency, not throughput. A CPU is a surgeon. A GPU is a conveyor belt. The math an LLM runs is conveyor-belt math.

The transformer was designed for this hardware profile. Its predecessor, the recurrent neural network [a network where each step’s output feeds into the next step’s input, chaining forward one token at a time], had a hard sequential dependency built into its shape: the hidden state at step t needed the hidden state at step t-1, which pinned everything to one core and made GPUs mostly useless for generation. The 2017 Attention Is All You Need paper (Vaswani et al. 2017) swapped that recurrence for self-attention, which is a batched matrix multiplication over all positions at once. That single architectural choice is a large part of why transformers could train to scales RNNs never reached. The math mapped cleanly onto GPU compute, and GPUs kept getting bigger.

Tensor cores took this further. Starting with NVIDIA’s Volta chips in 2017 and growing steadily denser each generation, GPUs added dedicated hardware for low-precision matrix multiplication: first FP16, then BF16, then INT8, now FP8 and FP4 on Blackwell. Each precision step roughly doubles throughput and halves memory traffic for the same operation, at some measurable but manageable accuracy cost. We come back to this in Section 4.4 and to the full treatment in Chapter 22. The tensor-core path is how a frontier model served in a datacenter pushes hundreds of tokens per second per GPU. It runs the hot matmuls at FP8 or lower while keeping a few sensitive operations in higher precision.

Back to the paper. When you paste your forty pages in, every token in that paper turns into a long list of numbers, those numbers flow into a stack of matrix multiplications, and the shape of the hardware the multiplications run on is what sets how long you wait for the first bullet to appear. A CPU could do the same math in principle. You would be waiting long enough to read the paper yourself.

The framework underneath all of this is the roofline model (Williams et al. 2009). A roofline plot pairs the performance a kernel could hit against its arithmetic intensity [the ratio of floating-point operations performed to bytes moved through memory]. Operations with high arithmetic intensity do a lot of math per byte read. Large matmuls are in this category. Their ceiling is the flat part of the roof, set by FLOPS. Operations with low arithmetic intensity do little math per byte read. Elementwise ops and attention during token-by-token generation live here. Their ceiling is the slanted part of the roof, set by memory bandwidth. Where a kernel sits on the roofline tells you what making it faster would require. Compute-bound kernels want more FLOPS or lower precision. Memory-bound kernels want more bandwidth or less data movement. We come back to this lens several times in the chapter.

NVIDIA’s own treatment of inference optimization (NVIDIA 2024) and Baseten’s operator-level walkthrough (Baseten 2024) cover the practical details with first-party measurements. Both are worth reading alongside the academic sources.

GPUs are not the only option, and this book will not pretend otherwise. Apple Silicon’s unified-memory architecture changes the sizing math in ways we come back to in Section 4.5. Groq, Cerebras, and SambaNova ship dataflow accelerators that beat GPUs on specific serving workloads. Groq’s LPU [Language Processing Unit, Groq’s purpose-built chip for transformer inference, heavy on bandwidth and scheduling determinism] in particular trades batch throughput for per-token latency and wins decisively at low batch sizes. AMD’s MI300X reached GPU parity on several LLM benchmarks through 2025. AWS Trainium and Google TPUs round out the serious non-NVIDIA options. The argument here is narrower than “GPUs are the right hardware.” It is that dense matmul throughput dominates inference, and the hardware in front of you should be chosen on how well it delivers that throughput at your precision and memory budget. For most operators in 2026, that still means a GPU. For some, it does not.

4.2 VRAM: where the weights live

A model’s weights have to sit in the device’s local memory to be fast. On a discrete GPU, that local memory is VRAM. On Apple Silicon, it is unified memory shared with the CPU. On a datacenter accelerator, it is HBM [High Bandwidth Memory, stacked DRAM chips sitting millimeters from the compute die, connected by thousands of short wires instead of one narrow trace on a circuit board]. A load from local memory takes a few nanoseconds. A load from system RAM across PCIe [Peripheral Component Interconnect Express, the bus that connects a GPU card to the rest of the computer] takes tens of microseconds. That is roughly a thousand to one. No clever software trick papers over a thousand-to-one gap, and every serious inference stack treats VRAM as the first-class constraint.

The basic accounting is straightforward. A model with N parameters at b bits per parameter occupies roughly N × b / 8 bytes. A 9-billion-parameter model at FP16, where each weight is two bytes, takes 18 GB. The same model at 4-bit quantization (Q4_K_M, which we unpack in Section 4.4) lands at about 4.5 GB plus a small overhead for the quantization metadata. 18 GB does not fit on a 16 GB consumer card. 4.5 GB does. That is why quantization is not a nice-to-have. It is the difference between the model fitting on the card under your desk and the model being unrunnable.

Weights are only one line item in the VRAM budget. The full picture has five.

  • Weights, static, dominant for small batches, loaded once at startup.
  • KV cache, dynamic, per-request, grows with context length.
  • Activations, transient forward-pass buffers, scale with batch size.
  • Workspace, scratch memory for the matmul and attention kernels.
  • Fragmentation headroom, because real memory allocators leave 5 to 10% unused.

The KV cache [key-value cache, a buffer that stores the attention inputs from every token the model has already seen, so the next token does not have to recompute them] is where most operators get surprised. Here is the shape of the problem. During generation, every attention layer remembers the key and value projections of every token that came before, so the next token can attend to them without redoing the work. That remembered data is real memory, per request, and it grows linearly with how long the context is.

Work through it with numbers. A 9B model has 32 layers and an attention head dimension of 128. Run a single request at 8K context in 16-bit KV, and you use roughly 2 × 32 × 8192 × 128 × num_kv_heads × 2 bytes. For a group-query-attention [an attention variant where several query heads share a single key and value head, which cuts KV cache size without much accuracy loss] model with 8 KV heads, that works out to about 1 GB of KV cache for one request at 8K. Double the context, double the cache. Run eight concurrent requests, multiply again. 8 GB of memory gone before you have done anything but remember. The KV cache is the dominant VRAM cost for any multi-tenant serving workload with real context lengths, and it is why serving research from 2023 onward has been almost entirely about managing it better (Kwon et al. 2023; Kwon 2025).

Here is what this means for the paper summary. Your forty-page paper is maybe 30,000 tokens. Feed it in and the KV cache for that one request climbs to roughly 4 GB on a 9B model, before the model has produced a single bullet. Run the same paper twice at once, for two users, and you are at 8 GB just for the things the model is remembering. That is the number the serving engineer cares about. Not the weights. The weights were loaded once at startup. The KV cache is what scales with the number of people asking.

PagedAttention (Kwon et al. 2023) is the single most important result here. Before it, inference engines allocated the KV cache contiguously per request at the worst-case context length, which meant 60 to 80% of the cache memory sat reserved but unused for most requests. PagedAttention borrows the page-table idea from how operating systems manage virtual memory: split the KV cache into fixed-size blocks, allocate blocks lazily as the sequence grows, and keep a page table that maps logical positions to physical blocks. Fragmentation drops to near zero. Effective throughput per GB of VRAM goes up several-fold. Every serious serving stack in 2026 (vLLM, TensorRT-LLM, SGLang, MLC) runs some variant of this.

Offloading is the escape hatch when the model does not fit. FlexGen (Sheng et al. 2023) was the canonical treatment. If weights, KV cache, and activations together exceed VRAM, evict parts of them to CPU RAM and stage them back across PCIe as needed. The scheduler is clever. The physics are brutal. PCIe 4.0 x16 peaks at about 32 GB/s. HBM on an H100 is 3 TB/s. The gap between those two numbers is a factor of a hundred. Anything that crosses PCIe becomes the bottleneck. Offloading makes models possible. It does not make them fast. Any use case beyond “I want to try a 70B model on a 24 GB card at single-digit tokens per second” needs the model to fit in device memory.

Working sizing math for a 16 GB consumer card running a 9B model at Q4_K_M. Weights: about 4.5 GB. Workspace and activations at batch 1: about 0.5 GB. Framework and driver overhead: about 1 GB. Remaining for KV cache: about 10 GB. At a 9B model’s rough KV footprint, that leaves headroom for roughly 64K tokens of context at single-batch, or four concurrent requests at 16K. See Chapter 22 for the full sweep across quantization levels.

Apple Silicon breaks this accounting in a specific way. Unified memory means the CPU and GPU share the same physical pool, with no PCIe hop between them. An M2 Ultra with 192 GB of unified memory holds a 70B model at 4-bit with room left for context. 192 GB is about six times what the largest consumer GPU ships with. A same-era NVIDIA workstation would need two H100s and a dedicated power circuit to match what fits in the Mac on your desk. The cost is bandwidth. Apple’s M2 Ultra runs at around 800 GB/s of memory bandwidth, roughly a quarter of an H100’s. The model fits, but each token comes out slower. We return to what this means practically in Section 4.5.

4.3 Memory bandwidth as the real bottleneck

Here is the claim that reframes everything an operator thinks about inference performance.

During autoregressive decode, tokens come out at the rate the GPU can stream the model weights through its compute units. Not at the rate the compute units can multiply. Memory bandwidth is the ceiling. FLOPS are a number on a spec sheet that stops mattering once bandwidth saturates.

This is unintuitive. The marketing copy for every GPU talks about FLOPS. The benchmark leaderboards report FLOPS. The word “TFLOPS” appears twelve times in the H100 datasheet. And yet when you measure what actually happens during a decode step on a modern serving stack, the GPU’s compute units sit idle more than half the time, waiting for weights to arrive from HBM. Let me walk through why.

4.3.1 Arithmetic intensity and the roofline

A matrix multiplication of an [M, K] matrix by a [K, N] matrix performs 2 × M × N × K floating-point operations and moves roughly M × K + K × N + M × N matrix elements through memory (setting aside cache reuse for a moment). The ratio of FLOPs to bytes moved is the operation’s arithmetic intensity. High intensity means lots of math per byte moved. Low intensity means the opposite. A compute-bound kernel sits in the high-intensity zone. A memory-bound kernel sits in the low-intensity zone. Where on the roofline a kernel lives tells you what would make it faster.

For transformer inference, the intensity depends heavily on which step you are in.

During prefill [the initial forward pass that processes the whole prompt at once, before any new tokens get generated], the batch dimension is the length of the prompt. A 1,000-token prompt produces matrices with M = 1000, and the matmuls come out large and roughly square. Arithmetic intensity is high. The kernel is compute-bound. FLOPS matter. This is why prefill on a well-tuned stack is fast and scales roughly with prompt length.

During decode [the token-by-token generation step that produces each new token after prefill finishes], the batch dimension is, effectively, one per request. Every matmul is tall and skinny: the model’s weight matrix is wide and square, but the input vector multiplying into it is [1, hidden] for one request. The full weight matrix has to be streamed from HBM for every single token generated. Arithmetic intensity collapses. The kernel is memory-bound. FLOPS sit on the sidelines while the memory subsystem does all the work.

This is where the paper summary splits in half. Prefill reads the paper. Decode writes the bullets. Prefill is FLOPS-bound and scales with paper length. Decode is bandwidth-bound and scales only with how long the summary itself is. That is why, on a bandwidth-limited card, a long paper costs time up front but the summary itself comes out at the same rate it would for a short paper. The two phases are paying two different taxes. When you wait a little for the model to “read” and then watch bullets stream out at a steady rate, you are watching prefill finish and decode start, and you are watching the hardware shift from one side of the roofline to the other.

Decode and prefill have opposite physics. A GPU with huge FLOPS and modest bandwidth, hypothetically, would scream at prefill and crawl at decode. A GPU with modest FLOPS and huge bandwidth, which is roughly the shape of Cerebras’s WSE-3 or Groq’s LPU, is the mirror image. Real GPUs balance the two, but the balance is never exactly right, and operators who ignore the split end up surprised.

4.3.2 Why a 9060 XT will have the bandwidth-bound shape

An RX 9060 XT has around 320 GB/s of memory bandwidth and around 25 TFLOPS of FP16 throughput. Take a 9B model at FP16, which gives you 18 GB of weights. During decode, each token requires a full sweep over those weights. The theoretical ceiling is 320 / 18 ≈ 17 tok/s. That is the absolute maximum rate the GPU can stream the model’s weights out of VRAM. No kernel cleverness will get you past 17 tok/s on that card at FP16. The physics will not allow it. Quantize to 4-bit, and the weight footprint drops to 4.5 GB. The bandwidth ceiling jumps to 320 / 4.5 ≈ 71 tok/s. Four times the speed for the same model, because you cut the amount of data the memory subsystem has to shove through the pipe by a factor of four. That is the operational motivation for running consumer models quantized.

The same argument, different numbers, applies to every card. For an H100 at 3 TB/s serving a 70B model at FP16 (140 GB of weights, sharded across two cards), the per-card decode ceiling is 3000 / 70 ≈ 43 tok/s. With FP8 weights, it roughly doubles. Batch eight concurrent requests and the denominator stays the same but the numerator gets amortized across more users, which is why datacenter serving cares so much about batch size and continuous batching. All of this follows from the single observation that decode is bandwidth-bound.

4.3.3 The 2025 literature tightened this

Three 2025 papers make this picture sharper than it was even two years ago.

Mind the Memory Gap (Anonymous 2025b) measured, on real hardware running real serving workloads, how often attention kernels stall waiting for memory. Even at large batch sizes, where the arithmetic intensity should recover because multiple queries attend to the same KV cache, more than half of attention kernel cycles get spent waiting on DRAM. The paper is a rebuttal to the informal assumption that batching alone makes attention compute-bound. It does not, at current hardware ratios.

Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need (Anonymous 2025a) runs a limit study. The question: what are the physical constraints on inference throughput, formally? The answer turns out to be those four. Bandwidth for weight streaming. Compute for large-batch attention. Synchronization for cross-GPU communication, which matters at frontier-scale serving. Capacity for the KV cache. Every serving optimization in the wild falls into tightening one of these four, and the paper is the clearest statement I know of why.

Rethinking LLM Inference Bottlenecks (Anonymous 2025c) adds a crucial pluralist note. Multi-head Latent Attention [MLA, an attention variant introduced by DeepSeek that compresses the KV cache into a low-rank latent space instead of storing every key and value outright], adopted by several 2025 frontier models, has a second-order effect most operators missed. The latent projection reintroduces enough arithmetic density that attention under MLA is no longer strictly bandwidth-bound. It shifts toward compute-bound. Meanwhile, the more common MHA and GQA variants, which Llama, Qwen, Mistral, and most other open models use, remain bandwidth-bound during decode. The physics depends on the architecture variant. The blanket statement “decode is memory-bound” is true for the models most operators run today. It is not a timeless law. It is a fact about the most common attention shapes at 2026’s hardware ratios.

4.3.4 FlashAttention: the same insight, applied to attention

FlashAttention (Dao et al. 2022) is the cleanest demonstration of the bandwidth-dominance idea. The original 2022 paper showed that attention was not, as everyone had assumed, compute-bound. It was bound by the cost of writing the N × N attention matrix out to HBM and reading it back. Tile the computation. Keep the attention matrix in SRAM [Static RAM, the small, fast on-chip memory that sits right next to the compute units, orders of magnitude faster than HBM]. You get the same exact result with 2x to 4x less memory traffic and correspondingly better wall-clock time. The paper’s framing, “IO-awareness,” renamed the whole field. FlashAttention-2 (Dao 2023) and FlashAttention-3 (Shah et al. 2024) extend the idea to newer hardware (Hopper, Blackwell) and lower precisions (FP8), each time picking up another factor of 1.5x to 2x. The trajectory of attention research since 2022 has been, essentially, “find more creative ways to avoid moving bytes.”

For operators, a pair of heuristics falls out of all of this.

  • Decode latency is set by bandwidth. If you want faster per-token generation on a fixed model, either quantize (cut bytes per weight) or buy a card with more bandwidth. FLOPS upgrades alone barely move the needle for single-stream decode.
  • Prefill throughput is set by FLOPS. If you care about time-to-first-token on long prompts, a compute-heavier card helps. This is why, at large context sizes, prefill and decode are increasingly disaggregated onto different hardware. The serving stack runs prefill on compute-dense GPUs and streams the KV cache to bandwidth-dense GPUs for decode.

This is the single most useful heuristic in the chapter. Before comparing two GPUs for LLM work, look at both their FLOPS and their memory bandwidth. The higher of the two numbers, relative to the workload shape you care about, will be your ceiling. Pierre Lienhart’s Dissecting Model Performance (Lienhart 2024) walks through the arithmetic-intensity math with actual numbers and is the best single article I have read on the operator-level implications.

4.4 Quantization, introduced

Given what the previous section established, the motivation for quantization stops being mysterious. Decode is bandwidth-bound. Tokens per second equals bandwidth divided by weight footprint. So if you can store the weights in 8 bits instead of 16, the footprint halves and the decode ceiling doubles. Store them in 4 bits and the ceiling quadruples. You pay some accuracy for each halving, and that accuracy cost is what the field has spent a decade learning to minimize.

Come back to the paper one more time. Summarize it on a consumer card at FP16 and you are looking at maybe 17 tokens per second. Summarize the same paper with the same model at 4-bit and you are looking at maybe 71 tokens per second. A reader reading a summary aloud moves at about five words per second. 17 tokens per second is roughly the pace of a confident speaker. 71 tokens per second is faster than you can skim. The accuracy difference between the two summaries, on a task like this one, is usually small enough that a careful reader could not tell you which model wrote which. That is the quantization trade in one sentence. You are buying a fourfold speedup in reading time for a cost that, on most tasks, you cannot feel.

The landmarks in the field are worth naming.

8-bit is close to lossless for inference on most models, and has been since LLM.int8 (Dettmers et al. 2022) showed how to handle the activation outliers that break naive INT8 quantization. “Close to lossless” is a real, measurable claim: on standard benchmarks, 8-bit post-training quantized models score within noise of their FP16 counterparts. 8-bit is the default for a lot of production serving when memory is tight but accuracy matters.

4-bit is the current consumer-hardware sweet spot. GPTQ (Frantar et al. 2023) was the breakthrough. It is a post-training quantization method that uses second-order information (Hessian-weighted error) to decide which weights to round how, and it achieves 3-to-4-bit quantization on a 70B model at a cost of a few percentage points on downstream tasks. Modern variants (AWQ, GGUF’s Q4_K_M, bitsandbytes NF4) refine the technique, but the core insight traces back to GPTQ. For 7B to 14B class models, Q4_K_M is what most operators run locally. The accuracy loss is genuinely small. The memory win is genuinely large.

2 to 3 bit is where the field gets interesting. Naively rounding weights to 2 bits destroys accuracy, and the reason is worth pausing on. Transformer weights do not distribute evenly. A few weights carry very large magnitudes. Most sit close to zero. When you round to two bits, you have four possible values, and the few big weights get flattened into the same bucket as everything else. The model loses precisely the signals that were doing the most work. Rotation-based methods fix this by applying a random or learned orthogonal rotation before quantization, which redistributes the outliers across the weight matrix so no single bucket has to absorb all of them. QuaRot (Ashkboos et al. 2024) uses a fixed Hadamard rotation. SpinQuant (Liu et al. 2024) learns the rotation via Cayley-constrained optimization. TurboQuant (Zandieh et al. 2026), Google Research’s 2025 paper accepted to ICLR 2026, extends this to online vector quantization of the KV cache and demonstrates 3-bit KV with essentially zero accuracy loss on Gemma and Mistral. Quantization-aware training (EfficientQAT (Chen et al. 2025) and related work) pushes further, reaching 2-bit Llama-70B on a single A100 with under three accuracy points lost.

The shape of the accuracy-versus-bits curve is the thing to hold in your head. It is not linear. FP16 to 8-bit is close to free. 8-bit to 4-bit costs a little. 4-bit to 3-bit costs more and needs rotation tricks. 3-bit to 2-bit requires quantization-aware training and the loss is real. Every halving buys you 2x memory and 2x bandwidth-bound tokens per second, at an accuracy cost that grows non-linearly. Somewhere on that curve is the right operating point for your workload.

4.5 Why local inference is more viable than people think

The industry narrative through 2023 was that serious LLM inference required datacenter hardware. That narrative has aged badly. By April 2026, a 16 GB consumer card runs a 7B to 14B parameter model at 4-bit with room for 32K of context and responsive tokens-per-second for a single user. A $4,000 Mac Studio runs a 70B model at usable speed. The physics of local inference shifted, and most operators have not fully updated.

Four things changed.

Quantization matured. The progression from GPTQ in 2023 through QuaRot and SpinQuant in 2024 to TurboQuant and EfficientQAT in 2025 compressed the effective memory footprint of a given-quality model by roughly 4x in three years. A model that required 16 GB at FP16 now fits in 4 GB at 4-bit with a fraction of a percent of accuracy loss. Everything downstream of memory-fit (bandwidth-bound tokens per second, batch size, context length) improved proportionally.

Open models closed most of the capability gap. As of April 2026, the best open-weight models on general benchmarks sit within striking distance of frontier closed models on a large fraction of real tasks. Qwen 3.5, Llama 4, DeepSeek V3, Mistral Small 3: the 7B to 70B class of open weights is no longer “good for open-source.” It competes with paid APIs on most knowledge and reasoning tasks, and beats them on some coding subsets. This matters because local inference is only interesting if the models worth running are available to run.

Consumer hardware kept getting more bandwidth. The RTX 4090 shipped at 1 TB/s in 2022. The 5090 shipped at 1.8 TB/s in 2025. Apple Silicon’s unified-memory bandwidth climbed from 400 GB/s on the M1 Max to 800 GB/s on the M2 Ultra. AMD’s RDNA 4 cards reach 300+ GB/s at the midrange. Every one of those numbers raises the bandwidth-bound tokens-per-second ceiling for any quantized model that fits.

Apple Silicon rewrote the economics. This is the biggest structural shift. A unified-memory architecture with 192 GB of shared RAM and 800 GB/s of bandwidth (the M2 Ultra configuration) holds a 70B model at 4-bit with 100+ GB of context and workspace headroom. No consumer GPU ships with enough VRAM to do the same at any price. It is slower per token than a same-generation H100, but dramatically faster per dollar for single-user workloads. Rajput et al. 2025 (Rajput et al. 2025) measured this carefully. MLX on M2 Ultra reaches around 230 tokens per second on a small model. llama.cpp on the same hardware reaches around 150 tokens per second. Both are single-user numbers in the range that makes an LLM feel responsive.

Those numbers deserve a human-scale anchor. A comfortable reading pace for an adult is about five words per second. 150 tokens per second is roughly 120 words per second, or about 25x faster than you can read. You will never be waiting on the model. The model will always be waiting on you.

Return to the paper one last time. On a $1,500 consumer card, the same forty-page paper that took two-and-change seconds to summarize on a cloud API in Part IV takes, call it, six seconds locally. Nobody times either one with a stopwatch. The local machine is already in the “fast enough to feel instant” regime for single-user summarization, and every bandwidth and quantization improvement since 2023 has made that regime bigger, not smaller.

The honest version of “local inference is viable” needs the workload class specified. It is viable for:

  • Single-user inference at reasonable latency (batch 1, conversational).
  • Latency-tolerant batch processing (document ingestion, summarization, embedding generation).
  • Context-bounded workloads (most tasks under 32K, some under 128K with care).
  • Privacy-critical or air-gapped deployments where sending tokens to a third party is off the table.

It is not yet viable, or at least not obviously cheaper than cloud, for:

  • Multi-tenant production serving at high QPS with strict latency SLAs.
  • Frontier-model workloads requiring 400B+ parameters or extremely long contexts at low latency.
  • Workloads where the per-token API price beats the amortized hardware cost at your utilization.

Cost hinges on utilization. A $2,000 GPU running at 100% amortizes quickly. The same GPU running two hours a day does not. The break-even against API pricing is a function of how hard you push the hardware, and for most individual operators, it sits closer than the industry narrative admits. We come back to the full TCO argument, with first-party Foundry numbers, in Chapter 21.

“Viable” is workload-specific. If you are a startup doing 100 requests per second of user-facing chat at a strict 500ms P95, go buy API credits. If you are a single developer who runs an LLM 50 times a day, writes code with it, summarizes documents, and occasionally processes a thousand-item batch, a $1,500 consumer card will pay for itself in eighteen months and the model will be available when the API is down. The distinction between those two use cases is the whole argument. The rest is arithmetic.

4.6 What this means for how you think about models

The physical constraints shape everything downstream. Picking a model is not a matter of looking at benchmark leaderboards and grabbing the top name. It is a matter of checking, in order, whether the model fits on your hardware, whether the decode bandwidth is enough for your latency target, whether the quantization level preserves the accuracy you need, and whether the cost per token at your utilization beats the alternatives. The benchmark score matters, but it is never the first question.

Imagine you are choosing between two models for the paper-summary task. Model A scores two points higher on a public benchmark. Model B is a billion parameters smaller and runs at 4-bit on the card under your desk. If you have the hardware for Model A at 4-bit and your accuracy threshold clears, pick Model A. If Model A does not fit and running it means offloading to CPU RAM across PCIe at a hundredfold bandwidth hit, pick Model B. If you are on Apple Silicon and Model A fits but decode is two tokens per second, pick Model B. The benchmark is the last tiebreaker, not the first filter.

Carry these heuristics forward.

  • VRAM is the first question when picking a model. If the weights plus KV cache plus activations plus workspace do not fit in device memory, everything else is moot. Start with the memory budget.
  • Bandwidth sets decode latency. FLOPS set prefill latency. Long prompts with short responses are FLOPS-bound. Short prompts with long responses are bandwidth-bound. Most production workloads sit somewhere in between. Most consumer workloads are bandwidth-bound enough that you should optimize for memory throughput, not raw compute.
  • Quantization is lossy compression, and 4-bit is the current safe default. You trade a small number of accuracy points for 4x memory and bandwidth. For consumer-class workloads, that trade is almost always correct. For safety-critical or evaluation workloads, run at higher precision and verify.
  • Local hardware sits closer to viable than the industry narrative suggests, for the right workload class. Define the class honestly, measure against your actual usage, and do the arithmetic. “Local is a toy” was true in 2022. It is not true in 2026.

One honest admission before we move on. The field does not yet have a clean theory for why some quantization schemes lose so little accuracy and others break the model. We know rotation helps. We know outliers matter. We know quantization-aware training beats post-training quantization at low bit widths. We do not have a first-principles account of which bit widths are safe for which architectures on which tasks. The answer today is still mostly empirical: run the benchmarks on the quantized model, see what drops, pick the level that clears your threshold.

Anonymous. 2025a. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity Are All You Need. arXiv:2507.14397. https://arxiv.org/abs/2507.14397.
Anonymous. 2025b. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference. arXiv:2503.08311. https://arxiv.org/abs/2503.08311.
Anonymous. 2025c. Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465. https://arxiv.org/abs/2507.15465.
Ashkboos, Saleh, Amirkeivan Mohtashami, Maximilian L. Croci, et al. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456. https://arxiv.org/abs/2404.00456.
Baseten. 2024. A Guide to LLM Inference and Performance. Baseten engineering blog. https://www.baseten.co/blog/llm-transformer-inference-guide/.
Chen, Mengzhao et al. 2025. EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2407.11062.
Dao, Tri. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2205.14135.
Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2208.07339.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.17323.
Kwon, Woosuk. 2025. vLLM: An Efficient Inference Engine for Large Language Models. EECS-2025-192. UC Berkeley EECS. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.
Lienhart, Pierre. 2024. LLM Inference Series: 5. Dissecting Model Performance. Personal engineering blog (AWS). https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f.
Liu, Zechun, Changsheng Zhao, et al. 2024. SpinQuant: LLM Quantization with Learned Rotations. Meta FAIR, arXiv:2405.16406. https://arxiv.org/abs/2405.16406.
NVIDIA. 2024. Mastering LLM Techniques: Inference Optimization. NVIDIA Developer Blog. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.
Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.
Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, et al. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. arXiv:2407.08608. https://arxiv.org/abs/2407.08608.
Sheng, Ying, Lianmin Zheng, Binhang Yuan, et al. 2023. “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.” International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2303.06865.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4). https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2026. TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” International Conference on Learning Representations (ICLR 2026, Accepted). https://arxiv.org/abs/2504.19874.