Part V — Bare Metal

VRAM Math and Picking a GPU

A GPU fits your model if three numbers add up to less than the sticker on the box.

Chapter 4 (Section 4.2) sketched the VRAM [video RAM, the memory that lives on the GPU card itself] budget as four line items: weights, KV cache, activations, workspace. That level of detail suits a reader who wants to understand why memory matters. It does not suit the reader with $350 or $2,000 or $40,000 in hand who has to pick a card tonight. This chapter is for that reader.

Same running example as the rest of the book. Your forty-page research paper, pasted into a model, summarized into five bullets. In the cloud, VRAM is someone else’s problem and you pay for tokens. On your own hardware, VRAM is the problem, and the arithmetic below tells you whether the summary will run at all. Summarizing that paper on a 9B model at 4-bit on a $349 consumer card is a different question from the same summary on a rented H100 that costs $3 an hour. Both are reasonable answers to “how do I summarize this paper.” They sit at opposite ends of a budget range spanning roughly two orders of magnitude, and the formula below covers both ends.

A warning about the numbers that follow. Every price in this chapter is a snapshot from April 2026. GPU pricing moves on tariff cycles, crypto cycles, supply constraints, and generational releases. The methodology will outlast the numbers; the numbers will not. Rerun the arithmetic against current prices before you buy.

23.1 The VRAM formula

Here is the whole chapter in one equation:

\[ \text{VRAM} = W + K + A + S + H \]

W is the weight footprint, the size of the model itself sitting in memory. K is the KV cache [the key and value projections the attention layer caches so the next token does not have to recompute them], which grows with how long your paper is and how many copies of the summary you run at once. A is activations, the scratch space the forward pass needs while it computes. S is workspace, the bookkeeping the inference runtime allocates for itself. H is headroom, the slack you leave on the table so nothing OOMs [runs out of memory, which crashes the process and loses your work] under pressure.

Each term deserves a sentence.

Weights. A model with N parameters at b bits per parameter takes \(N \times b / 8\) bytes on disk and in VRAM once loaded. Qwen 3.5 9B at FP16 [16-bit floating point, the standard full-precision format for inference] takes 9 billion * 2 bytes = 18 GB. At 4-bit quantization shrinking each weight to 4 bits instead of 16, the topic of 22 using the Q4_K_M format we landed on earlier, the same model collapses to roughly 4.5 GB plus a small overhead for per-block scale factors and zero points. Quantization metadata usually adds another 5 to 10 percent, which operators routinely forget when they look only at the headline bit-width. So the 9B model that “takes 4.5 GB” takes closer to 4.9 GB once the bookkeeping lands.

To anchor that number: the 9B weights at 4-bit fit inside the RAM that shipped with a decent laptop in 2015. The same 9B at FP16 takes three times that. The same 9B, unquantized, at FP32 [32-bit floating point, the precision used during training] takes six times that.

KV cache. The attention layer caches the key and value projections of every prior token so the next token’s attention can read them without recomputing. The cache grows linearly with sequence length and linearly with concurrent requests. For a multi-head-attention model, the per-token footprint is:

\[ \text{KV}_{\text{per token}} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times p \]

L is the number of transformer blocks stacked in the model. \(H_{\text{kv}}\) is the number of key-value heads (the “KV head count,” not the full attention head count on grouped-query and multi-latent variants). \(d_{\text{head}}\) is the per-head dimension. p is the bytes per cached element. The leading 2 accounts for storing both K and V. Multiply the whole expression by sequence length to get the per-request cache. Multiply that by concurrency to get the serving total.

This line in the budget surprises new operators more than any other. We walk through it concretely in the next section.

Activations. The transient buffers the forward pass allocates while it computes attention and feed-forward layers. Two modes of the model have different activation profiles. During prefill [the one-shot pass that processes the whole prompt in parallel before generation starts], activations scale with batch times sequence length times hidden dimension and can dominate the budget on long prompts. During decode [the token-by-token generation phase that follows prefill, one word at a time], only a single new token moves through per step and activations collapse to a small per-step buffer. For a single user summarizing a single paper, activations during decode sit between 200 MB and 1 GB and act as a flat line in the budget. For prefill on a forty-page paper, they spike up and then release.

FlashAttention (Dao et al. 2022; Dao 2023) changed the activation story during prefill. The trick is to keep the attention matrix in on-chip SRAM [the tiny, fast memory that lives on the GPU die itself, measured in megabytes rather than gigabytes] instead of writing it out to VRAM. The attention matrix never fully materializes in main memory, so the activation term during prefill does not blow up with sequence length the way it used to. This is why the 2026 VRAM picture differs sharply from the 2022 picture: the same model at the same context takes measurably less activation memory on a modern inference stack.

Workspace. The inference runtime’s scratch buffers. cuBLAS, cuDNN, and their AMD equivalents stage tiles of matrices for the GEMM [general matrix multiply, the single operation that dominates inference compute] kernels. The attention implementation tiles its own intermediates. The paged allocator keeps free-block bookkeeping. llama.cpp and ollama usually spend 500 MB to 1.5 GB here. vLLM with PagedAttention (Kwon et al. 2023) tends to run higher, 1.5 GB to 3 GB, because the paged block pool preallocates to reduce fragmentation during bursty traffic. The exact number is backend-specific and worth looking up for your stack.

Headroom. Real allocators fail before they hit literal capacity. CUDA’s memory pool fragments under mixed-size allocations. ROCm’s HIP allocator does the same with different failure modes. Leave 5 to 10 percent of total VRAM unused on top of everything else, or plan to spend your afternoon chasing out-of-memory errors that only surface after forty minutes of sustained generation.

Here is the short form, the version you memorize for the back of a napkin:

\[ \text{VRAM}_{\text{needed}} \approx W + (\text{context} \times \text{per-token KV}) + 2\text{ GB} \]

The “+ 2 GB” folds activations, workspace, and headroom together into a single term roughly correct for a single-user consumer-class workload, the case where you are summarizing your paper on your own desktop. Multiply the KV term by concurrency if you are serving the summary as an API to many users. That is the number you compare against the sticker on the GPU box.

Never fill the VRAM. The card that “fits” your model with 200 MB to spare will spend its life crashing on the thousand-and-first token of context or on the eighth concurrent summary request. Size the card to hold the model, your target context, and 15 to 20 percent free. Buying a card because it precisely fits the model you run today is the classic mistake of someone who has never watched an inference process die at 3 a.m.

Every input to the formula comes straight from the model’s config file. For any Hugging Face model, config.json gives you num_hidden_layers (that is L), num_key_value_heads (\(H_{\text{kv}}\)), head_dim (or compute it from hidden_size / num_attention_heads), and the total parameter count. EleutherAI’s Transformer Math 101 (Anthony et al. 2023) and its companion calc_transformer_mem.py tool (EleutherAI 2024) remain the canonical references for this accounting. LMCache’s online calculator (LMCache project 2025) lets you check your spreadsheet against a second implementation. Picking the right card is arithmetic, and the arithmetic is public.

23.2 KV cache sizing: the context multiplier

Here is the number that surprises everyone.

Take Llama 3.1 70B, one of the larger open-weight models you might consider for serious paper-summary work. Eighty transformer blocks stacked on top of each other, eight KV heads (grouped-query attention groups 64 query heads down to 8 KV heads, which we explain below), head dimension 128, FP16 KV storage. Per token, the KV cache takes:

\[ 2 \times 80 \times 8 \times 128 \times 2 \text{ bytes} = 327{,}680 \text{ bytes} \approx 0.31 \text{ MB} \]

A third of a megabyte per token sounds small. Walk it up.

At 8K context (a medium blog post): 2.5 GB of cache. At 32K (a short research paper): 10 GB. At 128K (our forty-page paper plus the prompt plus the running draft plus some slack): 40 GB.

Pause on that. The 70B weights at FP16 occupy 140 GB. Holding a single 70B conversation at a context the length of a short novel, roughly the length of The Great Gatsby, adds KV storage equivalent to half the contents of an H100. The cache on a long-context conversation is not a footnote; it is an appreciable fraction of the model itself. Eight concurrent users summarizing their own papers at 32K context turn the KV cache into the dominant line item in your memory budget, larger than the weights you were so careful to quantize. Almost every serving paper published since 2023 has been about managing the KV cache better, not about serving the weights better (Kwon et al. 2023; Kwon 2025).

Now rerun the math on a hypothetical 70B without grouped-query attention, the architecture that was standard before 2023. All 64 attention heads carrying their own K and V:

\[ 2 \times 80 \times 64 \times 128 \times 2 \text{ bytes} = 2{,}621{,}440 \text{ bytes} \approx 2.5 \text{ MB per token} \]

At 32K context: 82 GB of cache. More than the weights. This is the regime that made long context economically infeasible before 2023, and it is why grouped-query attention (Ainslie et al. 2023) (GQA) is not a footnote. GQA shrinks the KV cache by the ratio of query heads to KV heads. The saving is 4x on the Llama 3 family, 8x on Qwen 3. That factor is the difference between “you can advertise long context as a feature” and “long context crashes your allocator after the fourth page of the paper.”

DeepSeek-V2 (DeepSeek-AI 2024) pushed the compression further. Their trick is called Multi-head Latent Attention (MLA). It compresses K and V jointly into a low-rank latent space [a smaller set of numbers that carry the same information, reconstructed when the attention layer needs them] and restores them at attention time. The compression runs about 10x over vanilla multi-head attention. MLA has a second-order effect we flagged in Section 4.3: attention under MLA packs enough arithmetic density to shift back toward compute-bound [bottlenecked by the GPU’s math units rather than by the speed of its memory] (Anonymous 2025). The KV cache on an MLA model shrinks enough that the old budget assumptions break in the operator’s favor.

The 2025 production answer for KV cache at long context: quantize the cache itself. FP8 [8-bit floating point, half the size of FP16] cuts it in half with near-zero accuracy loss on most workloads. INT4 cuts it by four with manageable loss. TurboQuant (Zandieh et al. 2026) reaches 3 bits per channel with no measurable loss, 2.5 bits with marginal degradation. We cover this fully in Chapter 24. For this chapter’s purposes, two things matter. First, check whether your serving stack supports KV quantization before you size the GPU. If it does, you can halve or quarter the K term in the formula. Second, if you run llama.cpp or ollama today on consumer hardware, KV quantization ships but does not default on. The -ctk q8_0 -ctv q8_0 flags (or their equivalents) earn their keep on any long-context workload.

The KV cache is the context tax, and it is charged per request. Weights are paid once at load time. Cache is paid per token per concurrent request. A 16 GB card that holds a 9B model at 4-bit with headroom for 8K of context can hold the same model at 64K only by giving up on concurrency, and vice versa. The same card cannot summarize your paper at full length and answer another user’s question simultaneously. Before you buy a card, sketch the cache math at your target context and your target concurrency, not at the best-case single-request lightweight chat.

The runtime-overhead term, the gap between formula prediction and what llama.cpp actually allocates, lands between 0.5 and 2 GB on the cards I have measured against. Plan for it explicitly: if the formula says 14.5 GB, treat your 16 GB card as full.

23.3 What fits on what: the GPU-to-model sizing guide

The table below is the payoff of the formula. Each cell reads as “this model class, at this quantization, with this much context, fits comfortably with headroom on a card of this VRAM tier.” “Comfortably” means the formula leaves 15 percent slack. “Spills” means the model loads but the KV cache crowds the budget hard enough that usable context falls under 4K, or the runtime starts paging weights to system RAM (the RAM cliff, Section 22.5).

Read this table with the paper-summary task in mind. The forty-page paper we have been following runs about 20,000 tokens, and a safe target context is 32K to give the model room for the prompt, the paper, and its running draft of the summary. Find that cell on your budget row and you know whether your planned card can summarize the paper.

The model classes in the table are the grouped-query-attention dense models that dominate 2025 to 2026 open weights: Qwen 3, Llama 3 or Llama 4, Mistral Small, and comparable families. Mixture-of-experts models [MoE, architectures that route each token through only a subset of the model’s weights on each step] with small active sets, like Mixtral-style architectures, occupy more VRAM than the active parameter count suggests. The weights for all the experts have to sit in memory even though only a few are used per token. Size MoE models against their total parameter count, not their active count.

VRAM Typical cards Comfortable at Q4 Maximum dense Long context notes
8 GB RTX 3060 8GB, 4060, 9060 LP 7B at 8K 7B at 4-bit 14B spills; 8K to 16K on 7B
16 GB RX 9060 XT, RTX 4060 Ti 16GB, RTX 4080 Super 13B at 8K, 9B at 32K 14B at 4-bit or 9B at Q5 Consumer sweet spot. 32K on 7B to 9B realistic
24 GB RTX 3090, 4090, RX 7900 XTX 20B at 16K, 14B at 64K 30B at Q4 with tight context 70B needs offload (pays RAM cliff)
32 GB RTX 5090, Apple M4 Pro class 30B at 16K 30B at Q5 to Q6 128K on 13B realistic
48 GB RTX A6000, L40S, dual-4090 (no NVLink on 4090) 70B at 8K 70B at Q4 with working context Single-node 30B production serving
80 GB H100, A100 80GB 70B at 32K, 120B at Q4 70B at Q8 Frontier-research single card
141 GB H200 70B at FP16 with 64K, 120B at Q8 120B at Q6 to Q8, 405B at Q4 Production serving up to 400B class
192 GB MI300X, Apple M2/M3 Ultra 70B at FP16 with large batch 120B at FP16 with room MoE-friendly; shared-memory on Apple changes math

Two sharp edges on this table deserve naming in plain language.

First, consumer NVLink is gone. NVLink [a direct high-bandwidth cable that lets two NVIDIA GPUs talk to each other at hundreds of GB/s, bypassing the motherboard] shipped on the RTX 3090 and was quietly dropped on the 4090 and everything after. If you try to combine two consumer cards to reach a 70B model, the two cards have to talk over PCIe [the general-purpose bus every GPU plugs into, shared with every other device in the machine] instead of NVLink, and PCIe caps out at about 32 GB/s one-directional on PCIe 4.0 x16. Compare that to the 600 to 900 GB/s you get between two datacenter cards on an NVLink-equipped board. Twenty to thirty times slower.

Here is what that means for the paper-summary task in practice. Pipeline parallelism [splitting the model across two cards so each card runs a different chunk of the layers] tolerates slow interconnect: each card finishes its chunk of the layers and hands off a small amount of data to the next card. Tensor parallelism [splitting each layer across two cards so they do the same work in parallel on half the data] does not tolerate slow interconnect at all: the two cards have to exchange partial results inside every single attention step. On consumer PCIe, tensor parallelism adds enough communication overhead that a two-card setup runs slower than a single card that fits the whole model. A single card that fits beats two cards that have to talk over PCIe, almost every time (Meta Engineering 2025; AMD ROCm Team 2025). If you cannot afford a single card that fits your model, the answer is usually to pick a smaller model, not to buy two cards.

Second, Apple Silicon breaks the table. An M2 Ultra with 192 GB of unified memory [RAM shared between CPU and GPU, with no separate VRAM pool, so the model sees all of system RAM as its own memory] holds a 70B model at 4-bit with 100+ GB of remaining context and workspace headroom. No discrete consumer GPU at any price matches that capacity. The catch is bandwidth: Apple’s unified memory runs around 800 GB/s on the M2 Ultra, roughly a quarter of an H100’s HBM3 [high-bandwidth memory, the stacked-die memory datacenter GPUs use for the fastest possible throughput] bandwidth. The model fits; decode runs slower per token. Rajput et al. 2025 (Rajput et al. 2025) measured this on real workloads. MLX on M2 Ultra reaches 230 tok/s on small models. llama.cpp reaches 150 tok/s. 70B-class models run at low-teens tok/s, still fast enough to read along with as the summary streams in. Apple Silicon is the correct answer for operators who want to hold a 70B model at home and do not need datacenter latency. It is the wrong answer for multi-tenant serving or for workloads that care about first-token latency on long prompts.

23.4 Consumer card landscape, April 2026

The consumer GPU market for LLM work in April 2026 is pluralist in a way it was not in 2023. NVIDIA still has the best software story. AMD’s price-per-gigabyte puts it in serious contention for the budget tier. Apple Silicon answers a specific class of operator neither discrete vendor reaches. What follows is the honest comparison without picking a winner. We walk each card through the paper-summary lens: can it hold Qwen 3.5 9B at Q4 with 32K context and produce a five-bullet summary of your paper in under a minute, and what does that cost?

NVIDIA RTX 5090 (Blackwell, January 2025, $1,999 MSRP). 32 GB GDDR7 [the latest generation of graphics memory, a step up from GDDR6 in bandwidth and power efficiency], 1.79 TB/s memory bandwidth, 104 TFLOPS [tera floating-point operations per second, a measure of raw compute throughput] at FP16. For scale on the bandwidth number: 1.79 TB/s means the GPU can read or write every byte of its own 32 GB of memory about 56 times per second. The 4090’s memory ran at roughly 1 TB/s, so the 5090 is 78 percent faster at reading its own memory. For the bandwidth-bound decode workloads we covered in Section 4.3 (every token the paper-summary generates is a read of the entire model, then a small amount of math), that bandwidth uplift translates roughly one-for-one into per-token throughput on any model that fits. 32 GB VRAM covers 30B dense at Q5 or Q6, 70B at aggressive quantization with short context, and long-context 13B class work. Street price has wandered between the $1,999 MSRP and $2,400 to $2,800 in supply-constrained periods since launch. Check the market before you budget (NVIDIA 2025).

NVIDIA RTX 4090 (Ada, 2022, $1,599 MSRP). 24 GB GDDR6X, about 1 TB/s bandwidth. Still the workhorse for rigs purchased in 2023 to 2024, and still available new or lightly used in April 2026 at $1,400 to $1,800. For 13B to 20B dense work at Q4 or Q5, it remains the reference card. If you already own one, keep it. If you are buying fresh, the 5090 is the better choice unless you find a used 4090 at a genuine discount.

AMD Radeon RX 9060 XT 16GB (RDNA4, June 2025, $349 MSRP). 16 GB GDDR6, 320 GB/s memory bandwidth, 25 TFLOPS FP16 (AMD 2025; TechSpot 2025; Tom’s Hardware 2025). The budget local-LLM card, if you accept the Vulkan or ROCm-workaround ecosystem. At $349, the dollar-per-gigabyte beats the NVIDIA equivalents by 4 to 6x, and the 16 GB lands squarely in consumer-sweet-spot territory: 7B to 13B at Q4 to Q5 with 16K to 32K of context sit comfortably. For the paper-summary task, this is the rational consumer answer. You spend less on the card than two years of moderate API usage, and you get summaries on your own hardware with no per-token cost.

The catch is software. ROCm’s support for RDNA4 consumer cards ran uneven through 2025 and early 2026. The reliable path is Vulkan through llama.cpp, which works today and scales with the quantization level. I run this card as my benchmark rig, and the first-party measurements throughout Part V come from it.

AMD Radeon RX 7900 XTX (RDNA3, 2022, 24 GB, about 960 GB/s, around $850 used in April 2026). The 24 GB consumer AMD option. Better ROCm support than RDNA4 as of early 2026, because the card has been in the wild longer and the upstream patches have landed. For operators who need 24 GB on an AMD stack, the 7900 XTX beats newer RDNA4 cards with smaller software footprints.

Apple Silicon M3 and M4 Max or Ultra (unified memory, 64 GB to 192 GB). A Mac Studio with M4 Max at 128 GB starts around $4,000 in April 2026. The M2 Ultra configuration with 192 GB sits closer to $7,000. The unified-memory architecture reframes the whole VRAM question: there is no VRAM, just RAM, shared between GPU and CPU. If the model fits in system memory, it fits. Bandwidth caps decode, not capacity. For single-user work that needs to hold a 70B model, no discrete consumer option matches the memory-fit per dollar. Rajput et al. 2025 (Rajput et al. 2025) provides the production-grade measurement: MLX is the fastest backend on Apple, llama.cpp sits close behind, and both run fast enough for conversational interaction with 70B-class models. Your forty-page paper, summarized by a 70B model, on a machine that sits silently on your desk: this is the only consumer hardware that does that today.

The matrix the vendor pages do not surface:

Vendor Software story $/GB Sweet spot
NVIDIA Best. CUDA is the default; vLLM, TensorRT-LLM, and every paper target it first. Worst. You pay for the software. 13B to 30B production work, multi-tenant serving, research replication.
AMD Middle. ROCm is competitive on datacenter; consumer RDNA4 support has gaps. Vulkan via llama.cpp is the community escape hatch and works well. Best. 4 to 6x better $/GB than NVIDIA at the 16 GB consumer tier. Budget-constrained consumer local LLM, single-user work.
Apple Growing fast. MLX is now first-class for Apple Silicon; llama.cpp and ollama work well. Python ML ecosystem still mostly targets CUDA, so bench-to-production requires care. Middle. Unified memory means no discrete-card comparison is apples-to-apples. 70B+ single-user local inference where discrete GPUs cannot fit the weights.
Datacenter (H100, H200, MI300X) Best, by definition. Production-grade software, tested under load, every vendor library supports them. Not meaningful at individual scale. Rent, don’t own, unless you are operating continuously. Multi-tenant serving, frontier-scale models, continuous high-utilization workloads.

One thing to notice about this table. The four rows are four different shapes of the same trade, made of the same parts: memory capacity, memory bandwidth, compute throughput, and software maturity. NVIDIA pays the most for the best software. AMD pays the least and gets the worst software. Apple buys its way out of the trade entirely by sharing the CPU’s RAM with the GPU, which moves the whole question to a different axis. Datacenter cards win on every axis but lose on price to anyone not running them continuously. Same four ingredients, four different mixes, four different operators served.

The honest framing for April 2026: NVIDIA if you have the budget and want the software to work the first time. AMD if you are a budget-constrained developer willing to run llama.cpp on Vulkan. Apple if you need to hold a 70B model at home. Datacenter rentals if your utilization justifies the hourly rate. None of these is universally right. The axis that matters most is how much of your time you are willing to spend on the software stack, weighed against paying a premium to have it Just Work. I run a 9060 XT because the $/GB math was irresistible and I was willing to do the ROCm-workaround legwork. An operator who bills $150 an hour should buy the NVIDIA card and get back to their actual work.

Two inputs I did not put in the table. One is power draw, which matters at the datacenter tier but rarely binds a single-developer machine unless the PSU sits at its limit. The other is thermal, which starts to matter the moment you try to put two cards in one case. Both are real and both are project-specific. Check your case, your PSU, and your room cooling before you finalize a two-card build.

A word on the vendor claims themselves, since we have just recited a lot of marketing copy. NVIDIA says the 5090 is 78 percent faster than the 4090 on the bandwidth number. That claim is reproducible on a bandwidth microbenchmark, and it does track decode throughput on bandwidth-bound workloads. But “78 percent faster” on the bandwidth benchmark is not “78 percent faster on your paper-summary task.” Other bottlenecks (prefill, sampling, batching overhead) cap the end-to-end speedup. A simpler description, which vendor pages do not lead with, is that the 5090 is roughly 1.5 to 1.7x faster than the 4090 on realistic single-user decode workloads, closer to 1.8x on ideal benchmarks. Sagan’s baloney-detection rule applies here (Sagan 1995): is the number reproducible, does a simpler explanation fit, and who benefits from the bigger version of the claim? The smaller version is honest; the larger version is the one you see in the presentation.

23.5 Datacenter cards: H100, H200, MI300X

Datacenter cards are not usually a consumer purchase, but they rent by the hour from every major cloud. The rental economics are what most operators care about when deciding whether to put the paper-summary pipeline on their own hardware or on somebody else’s. Four cards dominate the April 2026 landscape.

NVIDIA H100 (2022, 80 GB HBM3, 3.35 TB/s). The workhorse. Every serving stack targets it. Every benchmark reports against it. Cloud hourly rates in April 2026 sit around $2 to $4 per hour on-demand, lower on reserved. For single-node 70B serving at Q8 [8-bit quantization, higher quality than Q4 and half the size of FP16] or 120B serving at Q4, the H100 remains the default. If you are replicating a paper or deploying a model a lab ships, the H100 is the instrument of record.

To anchor the bandwidth number: 3.35 TB/s is roughly ten times a consumer GPU’s bandwidth. The H100 can read its own 80 GB of HBM about 42 times per second. A 70B model at FP16 sits in 140 GB, which overflows the H100, which is why you run it at Q8 and land under 80 GB.

NVIDIA H200 (2024 refresh, 141 GB HBM3e, 4.8 TB/s). Same Hopper [the architecture generation before Blackwell, introduced in 2022] architecture as the H100. More memory, faster memory. NVIDIA’s own measurements report 2x inference throughput on Llama 2 70B against the H100 (NVIDIA 2026), which tracks with the bandwidth uplift since decode is bandwidth-bound. 141 GB VRAM holds a 70B model at FP16 with context headroom, or a 405B model at Q4. Hourly rental sits in the $3 to $5 range on the major clouds.

AMD Instinct MI300X (2023, 192 GB HBM3, 5.3 TB/s). AMD’s challenger, and a serious one on paper. The memory capacity edge over the H200 is real, and it matters for large MoE models and extra-long context workloads. ROCm support for the datacenter line has held up longer than for consumer RDNA, and the production serving stack (vLLM on ROCm, TGI, SGLang) has closed most of the gap by early 2026. Operators who specifically need the memory capacity find the MI300X competitive. Operators who want the default path still pick NVIDIA (AMD 2026).

NVIDIA Blackwell B200 and GB200 (2025). NVIDIA’s 2025 to 2026 datacenter flagship. Priced and allocated for hyperscalers, rented via the largest cloud providers at premium rates. FP4 [4-bit floating point, the most aggressive quantization format in current production hardware] tensor cores push per-watt throughput above Hopper, and the NVLink fabric on GB200 systems makes multi-GPU tensor parallelism cheaper. For an individual operator or a small SaaS, Blackwell remains largely inaccessible in April 2026. Note its existence; do not pretend you can rent one casually.

Here is the operational point about datacenter cards, brought back to the paper-summary task. The rental economics win decisively at low utilization. Take an H100 with a $40,000 sticker, used eight hours a week at $3 an hour rental: that amortizes to $1,248 a year of cloud spend against the $40,000 outlay, plus depreciation, cooling, and the rest. The crossover where owning beats renting sits at continuous or near-continuous utilization. Shi et al. 2025 (Shi et al. 2025) worked the full total cost of ownership for an 8x H100 cluster and landed the break-even at roughly 8,556 hours of continuous use. That is close to a year of round-the-clock operation. If you are summarizing three papers a week, the math says rent. If you are running a paper-summary service that processes a hundred papers an hour, the math says buy. We return to this arithmetic in Chapter 21.

23.6 Worked example: sizing a single card for Qwen 3.5 9B

Time to run the formula on an actual configuration. Back to the paper-summary task one more time, with real numbers attached.

You have a forty-page research paper you want summarized into five bullets. Forty pages of dense academic text runs roughly 20,000 tokens. Add the prompt, the running draft of the summary, and a little slack, and you are looking at a target context window of 32K to be safe. Qwen 3.5 9B is the model we will size for: 9 billion parameters, 32 transformer blocks, 8 key-value heads (grouped-query attention groups 5 query heads per KV group), head dimension 128, hidden dimension 5120.

Run the formula line by line.

Weights at Q4_K_M. 9 billion * 4 bits / 8 = 4.5 GB. Add roughly 10 percent for quantization metadata. Call it 4.9 GB.

Per-token KV cache at FP16. 2 * 32 * 8 * 128 * 2 bytes = 131,072 bytes = 0.125 MB per token.

At 32K context, single request. 32,768 tokens * 0.125 MB = 4 GB of KV cache.

Activations and workspace. Call it 1.5 GB for llama.cpp with the default flags.

Headroom. 15 percent of a 16 GB card is 2.4 GB.

Total: 4.9 + 4 + 1.5 + 2.4 = 12.8 GB. The 9060 XT has 16 GB, so the paper summary fits with 3.2 GB to spare. The $349 card summarizes your forty-page paper with room left over. You could push context to 48K or 64K before the budget breaks. At 128K you are out of room.

Now try Qwen 3.5 14B at Q4_K_M on the same card, a bigger model for a summary pass that cares more about quality. Weights: 14 billion * 4 bits / 8 * 1.1 = 7.7 GB. Per-token KV on the 14B (40 layers, 8 KV heads, head dim 128, FP16): 2 * 40 * 8 * 128 * 2 = 163,840 bytes = 0.16 MB per token. At 16K context: 2.6 GB. Activations and workspace: 1.5 GB. Headroom: 2.4 GB. Total: 7.7 + 2.6 + 1.5 + 2.4 = 14.2 GB. The 14B with 16K context fits on the 16 GB card with about 1.8 GB slack. Enough for a half-paper, or a tightly-prompted summary, but not the full forty-page dump at once. Pushing context to 32K spills: 7.7 + 5.2 + 1.5 + 2.4 = 16.8 GB, which overruns the 16 GB physical limit and crashes the run. Quantizing the KV cache to FP8 halves the KV term and returns the total to 14.2 GB, comfortably fitting.

This is the arithmetic you run before you choose a model, before you choose a quantization, and before you choose a context length. Three inputs, one equation, a clear answer.

The formula’s job is to tell you no. Most of its value is in ruling out configurations that will not fit, before you download 18 GB of weights to find out. “The 70B will fit on my 24 GB card” is the canonical statement made by someone who has not run the numbers and is about to learn what the RAM cliff feels like. Run the arithmetic first, check it against a calculator like LMCache’s (LMCache project 2025), and only then download the weights.

23.7 What to buy in April 2026

The practical recommendation, compressed, with the paper-summary task as the benchmark for each tier.

  • Entry: $350 to $400. AMD RX 9060 XT 16GB. You get 7B to 13B local inference at Q4 to Q5 with context to 32K. Your forty-page paper summarizes at about 150 tokens per second on Vulkan, which means a five-bullet summary arrives in under a minute. The software stack will cost you some hours on first setup. The hardware will cost you less than a month of API credits at moderate usage. If you are learning local inference or running an LLM occasionally for a single-developer workflow, this is the rational pick.

  • Solid middle: $1,500 to $2,000. NVIDIA RTX 5090 if you can find one at MSRP, RTX 4090 if you cannot and the used market cooperates. 24 to 32 GB of VRAM covers 20B to 30B dense models comfortably and handles long-context 13B workloads. The same paper summarizes on a 30B model that catches nuances the 9B misses, at roughly the same wall-clock time. CUDA software stack works the first time. This is the card for the operator who wants local inference to Just Work and is not trying to stretch every dollar.

  • 70B at home: $4,000 to $7,000. Mac Studio with M4 Max 128 GB or M2 Ultra 192 GB. The unified-memory architecture is the only way to hold a 70B model at home without datacenter hardware, and the tok/s stays conversational. The paper-summary at 70B arrives more slowly, closer to 15 seconds per paragraph than one second per paragraph, but it arrives, and the quality jump over 9B or 30B is the point. This is a genuine category, not a consolation prize.

  • Multi-tenant or production serving: rent, don’t buy. H100 at $2 to $4 per hour, H200 at $3 to $5 per hour on the major clouds. The break-even against owned hardware is at continuous utilization; almost no small operator is at that utilization. If your paper-summary service sees five papers an hour, the cloud bill is a rounding error against the capital cost of an owned H100. Rent until the math flips, then buy.

  • Frontier scale: you are not reading this book for advice. If you deploy 405B class models continuously, you have a procurement team, a sovereign power budget, and this chapter is not your reference. That said, MI300X or H200 are the two serious options, but the procurement and deployment story at that scale is outside what one operator can write usefully.

One honest caveat on all of this. Every number in this chapter is an April 2026 snapshot. The GPU market moves aggressively. NVIDIA’s Blackwell consumer refresh is expected in late 2026. AMD has RDNA5 on its roadmap. Apple’s M4 Ultra is widely rumored. The rental market shifts with datacenter capacity additions. Rerun the arithmetic against current prices before every purchase. The formula does not age. The numbers do.

23.8 Multi-card and the limits of the single-card formula

The sizing math here is a tool, not a conclusion. When you have more than one of these cards and have to split a model across them, interconnect becomes a line item the single-card formula does not include, and the bandwidth between cards (NVLink, PCIe, or whatever fabric you have) starts pinning the decode rate the same way memory bandwidth pins single-card decode.

The pluralism in this chapter is not decorative. Consumer NVIDIA, consumer AMD, Apple Silicon, and rented datacenter cards are each the right answer for a different operator. Reading this chapter as “here is the one GPU to buy” misses the point. The right card is the one whose memory tier holds your model with headroom, whose bandwidth supports your latency target, whose software stack you are willing to maintain, and whose price fits your amortization window. The formula tells you which memory tier you need. The rest is a decision about your time, your budget, and your tolerance for debugging kernel errors at 2 a.m. None of that is universal. All of it is arithmetic.

Close the chapter with the paper-summary task in your hands. You know the model (Qwen 3.5 9B or something in its class). You know the context (32K for the forty-page paper). You know the quantization (Q4_K_M as the default we landed on). You know the three numbers the card has to hold: weights, cache, and a couple of gigabytes of everything else. You know which tier of card meets the number. You know how much that tier costs in April 2026, and you know which tier wastes money for your use case. The next time someone tells you “you need an H100 to run a real model at home,” you will be able to check their arithmetic against yours.

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). https://arxiv.org/abs/2305.13245.
AMD. 2025. Radeon RX 9060 XT Product Documentation. Current product documentation. https://www.amd.com/en/products/graphics/desktops/radeon/9000-series/amd-radeon-rx-9060xt.html.
AMD. 2026. Instinct MI300X Accelerator Product Documentation. Current product documentation. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html.
AMD ROCm Team. 2025. Analyzing the Impact of Tensor Parallelism Configurations on LLM Inference Performance. ROCm Blogs. https://rocm.blogs.amd.com/artificial-intelligence/tensor-parallelism/README.html.
Anonymous. 2025. Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465. https://arxiv.org/abs/2507.15465.
Anthony, Quentin, Stella Biderman, and Hailey Schoelkopf. 2023. Transformer Math 101. EleutherAI Blog. https://blog.eleuther.ai/transformer-math/.
Dao, Tri. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2205.14135.
DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434. https://arxiv.org/abs/2405.04434.
EleutherAI. 2024. calc_transformer_mem.py — Transformer Memory Calculator. GitHub, EleutherAI/cookbook. https://github.com/EleutherAI/cookbook/tree/main/calc.
Kwon, Woosuk. 2025. vLLM: An Efficient Inference Engine for Large Language Models. EECS-2025-192. UC Berkeley EECS. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.
LMCache project. 2025. KV Cache Size Calculator. LMCache documentation. https://lmcache.ai/kv_cache_calculator.html.
Meta Engineering. 2025. Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism. Engineering at Meta blog. https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/.
NVIDIA. 2025. GeForce RTX 5090 Product Documentation. Current product documentation. https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/.
NVIDIA. 2026. H200 Tensor Core GPU Product Documentation. Current product documentation. https://www.nvidia.com/en-us/data-center/h200/.
Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.
Sagan, Carl. 1995. The Demon-Haunted World: Science as a Candle in the Dark. Random House.
Shi, Jiawei et al. 2025. A Cost-Benefit Analysis of on-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services. arXiv:2509.18101. https://arxiv.org/abs/2509.18101.
TechSpot. 2025. AMD Radeon RX 9060 XT 16GB Review. TechSpot review. https://www.techspot.com/review/2996-amd-radeon-9060-xt/.
Tom’s Hardware. 2025. AMD Radeon RX 9060 XT 16GB Review: Plenty of Performance with 16GB. Tom’s Hardware review. https://www.tomshardware.com/pc-components/gpus/amd-radeon-rx-9060-xt-16gb-review.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2026. TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” International Conference on Learning Representations (ICLR 2026, Accepted). https://arxiv.org/abs/2504.19874.