Part V — Bare Metal

Quantization in Depth

Quantization is the art of squeezing a model’s numbers into smaller boxes without breaking what those numbers say.

Here is the concrete stakes version, because an abstract version will slide off the page. You have the same forty-page research paper open that we have carried through the book. You want the five bullet points. This time you want them on your own hardware, no cloud, no API bill. You pick a model that looks right. You try to run it. It does not fit. Or it fits, runs for ten minutes, and gives you five bullet points that arrive long after you stopped caring.

The difference between a version of this that works and a version that does not is, almost always, quantization. Not which model you picked. Not how many parameters. The question of how many bits each of the model’s numbers gets written in when it lands on your card. That is the lever this chapter is about.

Chapter 3 made the case for doing it at all: store each weight in 8 bits instead of 16, or 4 instead of 8, and the speed at which the card can read those weights off memory rises in step, because there are fewer bits to read. That was enough motivation to reach for a quantized copy. It was not enough to pick the right one. Which quantized copy, for which model, on which card, doing which kind of work, to which accuracy budget, is a separate question, and it is the one that decides whether your summary arrives in thirty seconds or ten minutes.

The short version: the quantization landscape in April 2026 is a menu, not a ranking. Q4_K_M is the default most operators should reach for, and the word “default” is doing work. For named workloads, AWQ, GPTQ, QuaRot, SpinQuant, TurboQuant, EfficientQAT, or QServe is the better call. No one method has swallowed the others, and telling you otherwise would sell you a future that is not arriving.

This chapter also absorbs the backbone of the Synoros research article on VRAM, quality, and the RAM performance cliff (Costantino 2026). The article’s tables and rules of thumb are the operator truth most local-LLM users need. The book expands them with the academic literature underneath, the K-quant internals the article skipped, and the rotation-based methods covered in depth in Chapter 24.

22.1 What quantization is, and what it isn’t

Imagine writing down a long list of numbers with many decimal places. Now imagine rewriting the same list using whole numbers only, paired with a single shared rule that says “to get the real value back, multiply by 0.0032.” You have lost some precision on each number. You have saved a lot of paper. As long as the numbers you write down are close enough to the originals that whoever reads them gets the same answer they would have gotten before, the trade was worth it.

That is quantization. The model has hundreds of billions of numbers in it. We are writing them down in fewer bits each, paired with a shared scale, and hoping the answers do not change.

More carefully: quantization represents the model’s weights, and sometimes the intermediate numbers the model produces as it computes [called activations], and sometimes the attention state the model carries forward the KV cache, explained in 3.3, in lower-bit-width integers paired with a shared scale factor. A block of 32 or 64 weights shares one small floating-point scale, and each weight inside the block becomes a small integer that tells the decoder how to rescale back up. When the model runs, the instant a weight is needed for a multiplication [the matmul, the matrix multiply that is most of what a layer does], it gets decoded on the fly, multiplied against the activation, and the result accumulates at higher precision. The savings are real. The accuracy cost, with a good method, is small.

Three things quantization is not. The confusions cost operators time.

  • Pruning removes weights (sets them to zero and exploits the sparsity to skip the multiplications those weights would have done). Quantization keeps every weight; it just writes each one with fewer bits. You can stack pruning and quantization, but they are separate levers.
  • Distillation trains a new, smaller model to imitate a larger one. The output is a fresh set of weights at a smaller parameter count. Quantization keeps the same parameter count and reencodes each weight. Distilled models can be further quantized, and usually are.
  • Low-rank factorization. Methods like LoRA represent a weight update as the product of two smaller matrices, a structural change to what gets stored. Quantization does not change the shape of the weight matrix, it changes how many bits each entry takes.

A few pieces of vocabulary you will meet throughout the chapter.

Perplexity is the standard way to score quality after quantization. Think of it as how surprised the model is by the next word in a test passage, averaged across millions of words of test text. Lower is better, because less surprise means better prediction. A 1% perplexity bump from quantization is invisible outside careful A/B comparison. A 5% bump you feel. A 20% bump means something broke. Papers report perplexity on fixed test sets, usually WikiText-2 [a 100-million-word sample of Wikipedia, used as a shared yardstick across papers] or C4 [a several-hundred-billion-token slice of crawled web text used the same way], so that numbers across methods can be compared cleanly. Downstream benchmarks (MMLU for general knowledge, HumanEval for code, GSM-8K for grade-school math) matter more for real use but wobble more from run to run. Most of the literature reports both.

Post-training quantization, or PTQ, takes an already-trained model and compresses its weights afterward, without further training. The producer ships the full-precision weights. Someone downstream, or the producer’s own quantization pipeline, runs the model over a few hundred or few thousand example texts [the calibration samples, used to measure the typical range and distribution of the numbers inside the model] and writes out a new file with each weight packed in fewer bits. PTQ is cheap, runs in hours on normal hardware, and is what every GGUF file on Hugging Face is.

Quantization-aware training, or QAT, does something more expensive. It interleaves the quantization rounding into the training loop so the model learns to tolerate it. QAT costs GPU-days to GPU-weeks on serious hardware and produces better low-bit results than PTQ can. For most operators, PTQ is the only method inside budget. The distinction matters mostly because the 2-bit literature has been shifting toward QAT as PTQ hit its quality ceiling below 3 bits.

The shape of the accuracy cost as you drop bits matters more than any of the definitions above. Go from 16 bits per weight to 8 and almost nothing happens to the model, thanks to a 2022 result that solved the activation-outlier problem at 8 bits (Dettmers et al. 2022). Go from 8 to 4 and the cost becomes measurable but manageable. Go from 4 to 3 and the cost starts needing cleverness: outlier handling, rotation methods, careful calibration. Go below 3 and historically you fell off a cliff that only QAT could cross. The curve is not linear. The first halvings are nearly free. The last ones are not. Every good decision in this chapter starts with knowing which segment of that curve your workload lives on.

Back to the forty-page paper. If you quantize the model you use for the summary from 16 bits to 8, your card reads its weights twice as fast and your summary arrives in half the time, and you will not be able to tell the difference between the two outputs without squinting. If you go from 8 to 4, your summary arrives twice as fast again, and the output is slightly rougher but still correct on the things that matter. If you go from 4 to 2 with a naive method, your summary comes back fluent-sounding but wrong in ways you may not catch. The question is how far down the curve you can go before your summary stops being the summary you wanted. The rest of this chapter is about answering that for your specific model and your specific card.

Bits-per-weight is the axis everything else hangs on. Memory footprint shrinks in step with it. Decode speed rises in step with it. Accuracy drops faster than in step with it below 4 bits. Picking a quantization level is, at heart, picking a point on that trade curve. Everything else is tooling.

22.2 Quantization levels from FP16 through Q2

Now the map of the terrain. Think of this as a menu the model producer hands you, with a rough file size and a rough quality label on each dish. You pick based on what fits your card and what your summary can tolerate.

The numbers below come from three places, stacked: the llama.cpp community’s methodology-disclosed WikiText-2 measurements (llama.cpp contributors 2023b), bartowski’s published quantization sweeps across many models (Bartowski 2026), and the original quantization papers where those papers measured the same format. One honest flag up front: the percentage-quality numbers are operator heuristics. They are not peer-reviewed point estimates. They come from community measurement with disclosed methodology, and they vary by model. Treat them as the right order of magnitude, not as precision claims.

  • FP16 (16 bits per weight, 100% memory, 100% quality). The baseline. The model producer’s reference copy. Each weight sits in memory as a 16-bit floating-point number [the way modern hardware stores 16-bit real numbers: one bit says whether the number is positive or negative, five bits say roughly how big it is, ten bits say its fine detail]. Everything else on this list is a compression of this.

  • BF16 (16 bits, same footprint). Same total size as FP16 with the bits split differently: more room to represent very small or very large magnitudes, less room for fine detail. Datacenters prefer it during training. For the memory axis you are buying a card to live on, treat it as interchangeable with FP16.

  • FP8 (8 bits, ~50% memory, ~99.5% quality on supported hardware). The 2025 datacenter serving default on Hopper and Blackwell. Two formats exist (one for weights, one for gradients during training), the math is done by dedicated hardware on the card, and the quality loss is tiny. Not typically available on consumer cards. You will meet FP8 if you rent time on a datacenter GPU; on the card in your machine, you will not.

  • Q8_0 / INT8 (8 bits, ~50% memory, ~99% quality). Close to lossless on almost every model worth running, thanks to LLM.int8() solving the outlier problem (Dettmers et al. 2022). Q8_0 is llama.cpp’s straightforward 8-bit format. It is the maximum-quality quantization level for an operator who wants small memory savings without having to think about method choice.

  • Q6_K (~6 bits, ~38% memory, ~97% quality). A K-quant family member with medium granularity. Measurable quality loss but small. Useful when Q8 does not quite fit and the work is accuracy-sensitive.

  • Q5_K_M (~5 bits, ~31% memory, ~95% quality). The “I want one level above the default” choice. Sits comfortably above the main 4-bit step.

  • Q4_K_M (~4.5 bits, ~28% memory, ~92% quality). The default. On a 7B-to-70B dense model [a model where every parameter gets used on every token, as opposed to mixture-of-experts models that only activate a fraction of their weights per token] running locally, reach for this first. The rest of this chapter leans on it. Q4_K_M’s specific trick, covered in Section 22.3, is that it keeps a few sensitive tensors at 5 or 6 bits while compressing the bulk at 4. That is why it often outscores naive 4-bit formats at the same file size.

  • IQ4_XS (~4 bits, ~25% memory, ~95% quality). An importance-matrix variant that uses activation statistics to decide which weights to preserve at higher fidelity. Often outscores Q4_K_M at a slightly smaller file size. The costs: slightly slower inference on some backends because the bit packing is non-uniform, and a longer quantization run at producer time because the importance matrix has to be computed from calibration data.

  • Q3_K_S (~3 bits, ~22% memory, ~88% quality). Noticeable degradation. Legitimate if the choice is “Q3 fits, Q4 does not, and I need this model today.” Not a steady-state default.

  • Q2_K and below (~2 to 3 bits, ~15% memory, quality cliff). Viable only with rotation-based methods or QAT. Naive 2-bit PTQ destroys models; the summary of your forty-page paper would come back looking like an adversarial example. EfficientQAT (Chen et al. 2025) is what makes a 2-bit Llama-70B a real object that fits on a 24 GB card. We come back to this in Chapter 24.

The shape of the curve, one more time in plain English. Halve the bits per weight, roughly halve the memory, roughly double the decode speed, pay an accuracy cost that grows faster than linearly below 4 bits. 16 to 8 is nearly free. 8 to 4 is cheap. 4 to 3 is expensive enough to need the right method. 3 to 2 requires the right method and the right training procedure, and even then the loss is real.

Hold the paper-summary task next to this menu. At Q8, your summary is indistinguishable from the FP16 version, and the file is half the size. At Q4_K_M, the summary is barely distinguishable, and the file is a quarter the size, which is what lets the model fit on the card at all. At Q3, the summary starts showing its seams: sentences that sound like summary sentences but do not quite track the paper’s argument. At naive Q2, the summary is fluent nonsense. The curve is why Q4_K_M became the operator default.

Q4_K_M is the default, and the word “default” is load-bearing. Reach for it first on a 7B-to-30B dense model running locally. If Q5_K_M fits with room for context, use it; the accuracy recovery is noticeable. If Q4_K_M does not fit, drop to IQ4_XS before you drop to Q3. Never run Q3 or Q2 with a naive PTQ method when the work matters. Every other claim in this chapter is a refinement on top of this.

The Synoros research article (Costantino 2026) presents this as a single table and calls it a day. The book’s job is to explain why the table looks the way it does, which is the next section.

22.3 K-quant internals, or why Q4_K_M is not just “4-bit”

Here is a gap in the literature an operator-first book can use. The K-quant family, the format behind every GGUF file on Hugging Face, has no peer-reviewed paper defining it. The authoritative source is the llama.cpp repository itself (llama.cpp contributors 2024) and the community discussion threads that debated and refined the design (llama.cpp contributors 2023b). Every llama.cpp inference run decodes K-quants hundreds of times per token. The formal definition lives in the C++ source and the commit history, not in any conference proceedings.

We treat this as a feature of the operator framing rather than a bug. The deployed system is the reference. With that said, here is what the K-quants are doing inside your paper-summary run.

22.3.1 Block-wise grouping with super-blocks

Start with the simplest version of the problem. You have a big grid of numbers (a weight matrix). You want to write each number in four bits instead of sixteen. Four bits gives you sixteen possible values. You pick a single scale factor, round every number in the matrix to the nearest of those sixteen steps, and save the scale alongside the compressed numbers.

This works until you look at what the numbers actually look like in a trained transformer. Most of the weights are small. A handful are much larger than the rest, often by an order of magnitude. Force everything into the same sixteen-value grid sized for the big ones and the small ones all round to zero. Size the grid for the small ones and the big ones get clipped. This is the outlier problem [the problem that a handful of weights carry magnitudes far outside the bulk of the distribution, making one-size-fits-all scaling fail]. LLM.int8 (Dettmers et al. 2022) showed that isolating the outliers was what made 8-bit quantization work.

K-quants take the same intuition and apply it through grouped scales. Chunk the weight matrix into small blocks of 32 weights each. Each block gets its own scale factor. Group the blocks themselves into super-blocks of eight (so 256 weights per super-block) that share a second, higher-precision scale. The per-block scale is encoded tightly at low precision. The super-block scale is encoded at higher precision, once, amortized across its blocks. At runtime, each block gets decoded using its own scale and the super-block’s scale together.

Two useful things happen. First, a block that happens to hold a large outlier can stretch its scale without forcing every other block to use the same range. The outlier survives; the majority does not lose precision accommodating it. Second, the nested scale encoding is compact, because the super-block pays the precision price for its high-quality scale only once. The effective precision ends up close to what naive quantization would get with an extra bit or two per weight, for the same file size.

22.3.2 The _S, _M, _L suffixes

The suffixes are granularity dials. *_S* means smaller, more aggressive: fewer, larger blocks, smaller file, less metadata, slightly lower quality. *_M* is the balanced default; this is where Q4_K_M lives. *_L* spends more metadata on scales and produces a larger file with slightly higher quality. For Q4_K, the gap between _S and _M is measurable on hard evaluations but small in everyday use. For Q2_K and Q3_K, the gap grows, because at low bit widths the metadata overhead matters more and the _S variants start to show their limits.

22.3.3 The specific Q4_K_M trick

Q4_K_M’s quality edge over naive 4-bit comes from a targeted precision uplift on a small set of load-bearing tensors [the rectangular arrays of numbers that make up each layer’s weights]. The attention value projections and the final output projection (the lm_head that produces logits [the pre-softmax scores the model assigns to each possible next token], see Section 3.3) get quantized at 5 or 6 bits per weight. The bulk of the model stays at 4.

Here is why those specific tensors. The output projection’s error affects every token’s logit vector, which means every sampling decision downstream. The value projection’s error affects every attention-weighted mixture, which means every place downstream that uses attention. Errors in these tensors propagate widely; errors elsewhere tend to average out. Spending a handful of extra bits on the wide-propagating ones buys back a large chunk of the quality a flat 4-bit quant would have lost. The file grows by maybe 10%. The perplexity recovery is often several percentage points.

That is why Q4_K_M outscores plain Q4_0 (which quantizes every tensor uniformly) and often outscores GPTQ at 4-bit on the same model. Q4_K_M is not algorithmically cleverer than GPTQ. It is operationally smarter. It spends precision where the error budget is tight, and saves bits where the model is robust.

IQ4_XS improves on this in a different direction. The “I” stands for importance matrix [a table of per-weight importance scores computed by running calibration text through the model and recording how often and how strongly each weight fires]. Weights whose activations run consistently large get preserved at higher effective precision. Weights that rarely fire get compressed more aggressively. The result is quality per bit comparable to Q5_K_M at a file size closer to Q4_K_M. The costs are compute at quantization time (a calibration pass over the importance data) and slightly slower decoding at runtime because the bit packing is non-uniform.

For the paper-summary task, the Q4_K_M trick is why the summary you get is usable even at 4 bits per weight. The attention value projections are exactly what the model uses to decide which passages of the paper are relevant to each bullet it writes. If those tensors had been compressed to 4 bits with everything else, the model’s attention would start pointing at wrong sentences, and the summary would start drifting. The precision uplift on those specific tensors is what preserves the “look at page thirty-two when writing the bullet about page thirty-two” move from Section 1.1.

The K-quants are a community invention, cited here from source rather than paper. No peer-reviewed reference defines them. The llama.cpp repository (llama.cpp contributors 2024), the Discussion #406 thread (llama.cpp contributors 2023b), and bartowski’s methodology-disclosed quantization sweeps (Bartowski 2026) are the authoritative sources. For an operator, this is not a weakness. Those artifacts are closer to the running system than any paper could be. For a textbook, it is a gap we name explicitly rather than paper over.

22.4 GPTQ, AWQ, SpQR, and the post-training quantization family

The K-quants cover the CPU-plus-GPU generic case via llama.cpp and GGUF. The pure-GPU serving path has its own ecosystem, invented separately, and four methods in that ecosystem show up in 2026 production workloads. Each has a sweet spot. None has swallowed the others.

GPTQ (Frantar et al. 2023) is the workhorse. Here is the idea. When you round a weight from 16 bits to 4, you introduce a small error in that one number. Naive rounding minimizes the error in each weight considered alone. But the model does not care about weights alone; it cares about what the layer outputs when you multiply a bunch of weights together against an input. Errors in different weights can cancel each other out, or they can pile up. Which they do depends on how the weights interact in the multiplication. GPTQ uses second-order information [a table of second derivatives that captures how pairs of weights interact to shape the layer’s output, called the Hessian], computed from activations on the calibration samples, to decide which weights to round up and which to round down, and in what order, so that the errors cancel at the layer’s output rather than pile up. On a 70B model, GPTQ runs in a few GPU-hours and produces 3- or 4-bit weights with a few percentage points of perplexity increase over full precision. It was the first method to make sub-4-bit quantization usable at production scale.

AWQ (Lin et al. 2024), the MLSys 2024 best paper, comes at the problem from a different angle. Here is the angle. Most weights are equally (un)important. A small subset, roughly 1%, dominates the quality budget, specifically the weights in the input channels that see the largest activation magnitudes on calibration data. Treat them specially. AWQ scales those salient channels up before quantization (so they land on a finer grid at the same bit width), and scales the corresponding activations down by the same amount at matmul time. The math cancels: at full precision, the scaling is a no-op. At low bit widths, the salient channels get more effective resolution than they would have otherwise. AWQ runs faster at quantization time than GPTQ and often lands slightly better quality at 4-bit on production models. It is the default 4-bit method for vLLM and TGI serving workloads.

SpQR (Dettmers et al. 2024) pushes the outlier-isolation idea one step further. Instead of protecting a class of weights with scaling, SpQR stores the outlier weights at full precision in a sparse auxiliary representation. Typically this is the 1 to 3% of weights with the largest magnitudes. The remaining 97 to 99% get compressed at 3 or 4 bits. The sparse outlier storage adds a small amount of overhead, the compressed bulk does most of the work, and the result is near-lossless quantization at 33B on a 24 GB consumer card. SpQR held “how low can PTQ go?” before the rotation methods (see Chapter 24) took over that frontier.

QServe / QoQ (Lin et al. 2025) is the production-serving shape of this family. It is a co-designed algorithm plus kernel system, specifically W4A8KV4: 4-bit weights, 8-bit activations, 4-bit KV cache, with custom CUDA kernels that keep all three in their native precision through the matmul. The algorithm side is a refinement of AWQ; the kernel side is where the throughput comes from. On A100s and L40S cards, QServe reports 1.2 to 3.5 times the throughput of TensorRT-LLM on standard 7B-to-72B models. This is the template for 2025’s high-end serving: algorithm-plus-kernel co-design, not algorithm-on-top-of-generic-kernel.

Let us run the Sagan skeptical check on those QServe numbers before we move on (Sagan 1995). Who is running the benchmark? The QServe authors. Is it reproducible? The paper ships code, the community has partially confirmed the speedup, but the exact 1.2-to-3.5x range is setup-dependent and has not been independently retested at the full range of model sizes. Does a simpler explanation fit? The algorithm-plus-kernel co-design explanation is plausible and mechanically specific, and not “magic speedup appears.” Who benefits from it being true? The research group, the backing cloud vendors, and the open-source serving stacks that adopt it. Read the number as “a real speedup exists in the right ballpark; your specific deployment may land anywhere in the 1.2x to 3.5x range.” That is the right shape for the claim.

The decision matrix across these methods, for your paper-summary workload and for the workloads adjacent to it:

  • If you run locally on llama.cpp, Ollama, or anything GGUF-based, you are running K-quants. CPU-and-GPU support is unmatched, and the Q4_K_M sweet spot is real. Stick with K-quants unless you have a specific reason to leave.
  • If you run on vLLM, TGI, or SGLang with NVIDIA GPUs, AWQ is the mainstream 4-bit choice. GPTQ is the older, widely supported alternative; some models only have GPTQ quants published.
  • If you need near-lossless at 3 or 4 bits and can tolerate a specialty runtime, SpQR still holds its own.
  • If you are building a datacenter serving stack and want every last bit of throughput, QServe’s co-designed W4A8KV4 is the 2026 production frontier.
  • If you need below 4 bits and care about quality, go to Chapter 24.

No single method wins on every axis. Any book that canonicalizes one of them will read wrong by 2026. The landscape is plural, and expected to stay plural.

22.5 The RAM performance cliff

The second benefit of quantization, after the accuracy-preserving memory savings, is what it makes possible in terms of fit. Whether the model fits on the card, or spills, is the single most important decision in local inference. Almost every operator meets the cliff in the first week, and most of them meet it wrong.

Here is what is strange about local inference. A 16 GB consumer card’s on-card memory moves data at roughly 300 to 900 gigabytes per second, depending on generation. The wire between the card and the rest of the computer moves data at roughly 30 gigabytes per second on a current-generation slot. That is about the speed of a good SSD. If you want a physical comparison, 900 GB/s is the bandwidth of about a thousand hard drives reading at once; 30 GB/s is one fast SSD. The card reading from its own memory feels like reading from a firehose. The card reading from system memory over the wire feels like sipping through a straw. Same card. Same model. Two orders of magnitude apart on how fast data gets to where it is needed.

That gap is the RAM performance cliff. Now the mechanics of why it matters so much.

The physics are the same ones from Section 4.3. Generating the next word in your paper summary [autoregressive decode, the token-by-token generation loop the model runs to produce each new word] is limited by how fast the card can read its own weights, because each new token requires a full pass over every layer of the model. If the weights live on the card, they stream at hundreds of gigabytes per second, the card stays busy, and tokens come out fast. If even a handful of layers live on system RAM, those layers must stream across the thin wire on every single token. The card stalls on them every time. The layers that live on the card do not help; they have to wait for the slow ones. FlexGen (Sheng et al. 2023) was the canonical study of offloading and showed that clever scheduling can hide some of the cost for batch workloads where many requests share the offload traffic. For single-user decode, the scheduling tricks do not save you. The pipeline is inherently serial. PCIe is the ceiling.

The practical shape of the cliff, from community measurement on consumer hardware with llama.cpp (llama.cpp contributors 2023a) and what I have seen on the 9060 XT rig:

  • Model fully in VRAM. All layers live on the GPU. Decode runs at the bandwidth-bound ceiling, typically tens of tokens per second on a 16 GB card running a 9B model at Q4_K_M. This is the baseline.
  • A few layers offloaded (say, 11 of 36). Decode rate drops roughly 5x, not because the offloaded layers are individually slow, but because every token now has to round-trip to system RAM for those layers. The GPU sits idle waiting for data.
  • Most layers offloaded (say, 30% resident). Decode drops 15x or more. The GPU is a PCIe-bottlenecked accelerator for what is effectively a CPU inference run.
  • Full spill (0 layers on GPU). The GPU is a spectator. You are running on CPU-and-system-RAM. Typically single-digit tokens per second, unusable for interactive work.

Why the cliff is steep, in one sentence: each token needs every layer, so once any layer requires a PCIe round trip, that round trip is in the hot path for every token. The card streams its resident layers fast. That speed does not help, because it is gated on the slow one.

Come back to the paper-summary task. You have the forty-page paper loaded into the prompt. You want the five bullet points. If your card holds the full model, the first bullet starts appearing in a second or two and the full summary takes fifteen to thirty seconds. If your card holds most but not all of the model, the first bullet takes ten seconds to start, and the full summary takes a couple of minutes. If your card holds a small minority of the model, you walk away from the keyboard, make coffee, come back, and the summary is still running. The model that would not fit in VRAM is doing the same computation, sending the same tokens to the same output stream. It is doing it one PCIe round trip at a time, and the wire that carries the round trips is tiny.

This is the single most underappreciated fact about local inference, and the reason this section exists as its own chapter-internal topic. The quantization level you pick is not just a quality knob. It is the knob that decides whether your chosen model lands on the comfortable side of the cliff or the ugly side. A 14B model at Q8 might be 16 GB of weights and spill; the same 14B at Q4_K_M is 8 GB and fits with room for context. The Q4_K_M version is unambiguously better, because the Q8 version is not running at Q8 speed. It is running at PCIe speed.

Pick the largest model that fits in VRAM with one to two gigabytes of headroom, not the largest model available. A 9B at Q4_K_M that fits comfortably and generates at 40 tok/s will feel faster and produce better sessions than a 14B at Q4 that spills 2 GB and generates at 3 tok/s. Quality per token does not rescue you from a 10x tok/s cliff on an interactive workload. The arithmetic is in your favor on the smaller model almost every time.

One nuance for Apple Silicon and other unified-memory architectures: the cliff does not exist the same way there. Unified memory means there is no PCIe hop between “GPU-resident” and “CPU-resident” weights. They live in the same physical pool, and the bandwidth is whatever the unified-memory bus can deliver (400 GB/s to 800 GB/s on M2 and M3 Ultras). A 70B model that will not fit on any consumer discrete GPU runs usefully on an M2 Ultra with 192 GB unified memory (Rajput et al. 2025). The accounting is different enough that Chapter 3 flagged it and Chapter 23 will return to it in depth. For discrete GPUs with distinct VRAM and system RAM, the cliff is real, steep, and the dominant argument for reaching for a smaller quant before reaching for offloading.

22.6 When to reach for QAT instead

Everything above is post-training quantization, where someone takes a finished model and compresses it afterward. For almost every operator running a published model on their own hardware, PTQ is the only option. QAT requires the original model weights plus training infrastructure plus calibration data plus GPU time, none of which a downstream operator has. But there are three cases where QAT is worth reaching for.

You are quantizing to 2 or 3 bits and you need the quality back. PTQ below 3 bits has a quality cliff that rotation-based methods (QuaRot (Ashkboos et al. 2024), SpinQuant (Liu, Zhao, et al. 2024), TurboQuant (Zandieh et al. 2026)) have partially bridged, but not closed. QAT still wins at 2 bits. EfficientQAT (Chen et al. 2025) produced 2-bit Llama-70B on a single A100 in 41 hours with under 3 perplexity points of loss. That is the current state-of-the-art result for 2-bit at 70B scale. It is a real, running model on 24 GB of weight memory. A PTQ’d 2-bit Llama-70B is not usable.

You are deploying to fixed edge hardware and the quality loss from PTQ is unacceptable at serving time. Edge workloads that need 4-bit or lower quantization often justify the QAT overhead because the hardware is fixed, the model is deployed many times, and per-deployment quality matters more than per-model quantization time.

You are the model producer. If you train the model, you can bake QAT into the training pipeline, which amortizes the cost across every user. Meta’s LLM-QAT (Liu, Oguz, et al. 2024) is a data-free QAT method designed for this case; it produces a quantization-robust model as part of training rather than after.

For the paper-summary workload specifically, QAT matters only if you are trying to run a 70B-parameter model on a 24 GB card at 2 bits, because that is the specific corner where PTQ gives up and QAT rescues you. If you are running a 9B or 14B model on a 16 GB card at Q4_K_M, everything you need is in PTQ. If you are running a 70B on 24 GB and you want the quality of a bigger model on cheaper hardware, EfficientQAT is the name to start from.

The 2025 QAT frontier (ZeroQAT, StableQAT, RCP, DL-QAT) is pushing the 2-bit boundary further. These methods are research-stage as of April 2026; production-grade 2-bit deployment outside the EfficientQAT result is not yet common. Chapter 24 covers the frontier in more detail. The short version: if you are not the model producer and your workload is above 3 bits, QAT is not for you. If you are reaching for 2-bit on large models and you have A100-hours to spend, EfficientQAT is where to start.

22.7 Choosing a quantization level in practice

Now put the pieces together into a decision flow. You are sitting in front of the card in your machine. You want to run the paper-summary task on a model you downloaded from Hugging Face. Which quantization do you pick?

  1. Measure your VRAM budget. Total VRAM, minus framework overhead (1 to 3 GB for llama.cpp or Ollama), minus the memory the context window will eat for the KV cache at your target context length (see Chapter 23). The leftover is your weight budget.

  2. Find the largest model whose Q4_K_M weights fit in that budget with 1 to 2 GB of headroom. Do not reach past the cliff. A 14B at Q4_K_M that squeezes in with 100 MB to spare will crash under context pressure or kick layers to RAM. The 9B at Q4_K_M with comfortable headroom is the better choice. The forty-page paper’s KV cache alone, on a 9B model at 8K context, runs a gigabyte or more; leave room.

  3. If Q4_K_M fits with significant headroom, consider Q5_K_M or Q6_K on the same model. The quality uplift is measurable. The tok/s loss is small (you are still bandwidth-bound, just at a bigger weight footprint). For accuracy-sensitive work (code, structured extraction, RAG with retrieval verification), higher precision is worth the smaller context room.

  4. If Q4_K_M does not fit comfortably, try IQ4_XS before dropping to Q3. IQ4_XS is a smaller file than Q4_K_M with comparable or better quality. Operators who stop at the canonical K-quant list under-reach it.

  5. If IQ4_XS does not fit either, step down a model size before you step down a bit width. A 9B at Q4_K_M will almost always beat a 14B at Q3_K_S on real work. The K-quant tables understate how bad 3-bit quality gets on hard tasks. Your forty-page summary will come back fluent-sounding and wrong at Q3 in ways it will not at Q4_K_M on the smaller model.

  6. Q3 and below are emergency fits only, unless you are reaching for a rotation-based method or a QAT variant. For those, see Chapter 24.

  7. For GPU-only serving on vLLM or TGI, use AWQ or GPTQ at 4-bit. K-quants live in the llama.cpp path; AWQ is the mainstream choice for NVIDIA serving. If the model you want only has GPTQ quants published, use that. Quality differences are model-dependent and often smaller than the difference in which 4-bit method happens to be available for your model.

  8. For W4A8KV4 production serving on datacenter hardware, QServe is the 2026 leading edge. This is outside most operators’ scope, but worth naming because it is the direction high-end serving is heading.

Quantization is a decision matrix, not a ranking. Q4_K_M is the right default on a consumer local llama.cpp rig. AWQ is the right default on an NVIDIA vLLM serving setup. QuaRot or SpinQuant is the right call for end-to-end 4-bit activation and KV quantization. TurboQuant is the right call for data-free online KV quantization on long context. EfficientQAT is the right call for 2-bit model production. Each is correct in its domain. A book that canonicalizes one of them over the others would be wrong within a year.

22.8 First-party measurement: a seven-level sweep on Qwen3-8B

Appendix 3 runs the full sweep across seven GGUF quantization levels of Qwen3-8B on a rented RTX A5000, with throughput numbers from llama-bench and perplexity numbers from llama-perplexity against the WikiText-2 test split. The headline result on that hardware: Q4_K_M lands at ~58 percent of the Q8_0 file size, runs prompt-processing at ~96 percent of Q8_0’s rate, generates tokens ~43 percent faster, and pays roughly 0.07 perplexity points (within the test’s own noise floor) for those gains. The quality cliff sits between Q3 and Q4, not at Q4 and not at Q5. The shape of that cliff is the load-bearing claim; the absolute numbers are tied to one specific hardware/runtime stack and will shift on others.

22.9 What to carry forward

Quantization is the single most powerful knob an operator has on local inference economics. It halves memory per bit-width halving. It doubles the decode ceiling per halving. It decides whether your chosen model lands on the comfortable side of the RAM cliff or the ugly side. And the accuracy cost, with the right method, is small enough that the trade is almost always worth taking below full precision.

The operator takeaways, as they map onto the paper-summary task this book has carried through:

  • Q4_K_M is the default for local GGUF workloads on 7B-to-30B dense models. Pick the largest model whose Q4_K_M weights fit your card with headroom, and your summary will come back in thirty seconds rather than ten minutes. K-quants are a community invention, not a peer-reviewed formalism; the llama.cpp source and community discussions are the authoritative references (llama.cpp contributors 2024, 2023b).
  • The quality-retention percentages published in operator guides, including the Synoros research article this chapter absorbs, are aggregated community heuristics. They are approximately right; they are not peer-reviewed point estimates. Treat them as order-of-magnitude guidance, verify on your own workload when quality matters.
  • Post-training quantization (GPTQ, AWQ, SpQR, K-quants) covers almost every operator’s needs. Quantization-aware training (EfficientQAT and its successors) is what makes 2-bit work and is worth reaching for only at the low-bit frontier or when you are the model producer.
  • The rotation-based methods (QuaRot (Ashkboos et al. 2024), SpinQuant (Liu, Zhao, et al. 2024), TurboQuant (Zandieh et al. 2026)) are the 2024-to-2026 research frontier for sub-4-bit PTQ and KV cache quantization. We cover them in Chapter 24.
  • The RAM performance cliff is the single most important local-inference failure mode to avoid. Fit the model in VRAM with headroom, always, in preference to running a larger model that spills.
Ashkboos, Saleh, Amirkeivan Mohtashami, Maximilian L. Croci, et al. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456. https://arxiv.org/abs/2404.00456.
Bartowski. 2026. GGUF Quantization Repositories with Methodology Disclosure. Hugging Face model hub. https://huggingface.co/bartowski.
Chen, Mengzhao et al. 2025. EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2407.11062.
Costantino, Jack. 2026. Quantization Explained: VRAM, Quality, and the RAM Performance Cliff. Synoros research, synoros.io/research. https://synoros.io/research/quantization-vram-ram-performance-guide.
Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2208.07339.
Dettmers, Tim, Ruslan Svirschevski, Vage Egiazarian, et al. 2024. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression.” International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2306.03078.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.17323.
Lin, Ji, Jiaming Tang, Haotian Tang, et al. 2024. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration.” Proceedings of Machine Learning and Systems (MLSys 2024). https://arxiv.org/abs/2306.00978.
Lin, Yujun, Haotian Tang, Shang Yang, et al. 2025. QServe: W4A8KV4 Quantization and System Co-Design for Efficient LLM Serving.” Proceedings of Machine Learning and Systems (MLSys 2025). https://arxiv.org/abs/2405.04532.
Liu, Zechun, Barlas Oguz, Changsheng Zhao, et al. 2024. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models.” Findings of the Association for Computational Linguistics (ACL 2024). https://arxiv.org/abs/2305.17888.
Liu, Zechun, Changsheng Zhao, et al. 2024. SpinQuant: LLM Quantization with Learned Rotations. Meta FAIR, arXiv:2405.16406. https://arxiv.org/abs/2405.16406.
llama.cpp contributors. 2023a. Performance of Llama.cpp on Apple Silicon M-Series. GitHub Discussions, ggml-org/llama.cpp #4167. https://github.com/ggml-org/llama.cpp/discussions/4167.
llama.cpp contributors. 2023b. Perplexity (Quality of Generation) Scores. GitHub Discussions, ggml-org/llama.cpp #406. https://github.com/ggml-org/llama.cpp/discussions/406.
llama.cpp contributors. 2024. Quantization (k-Quant Design Notes and PR History). Ggml-org/llama.cpp repository and wiki. https://github.com/ggml-org/llama.cpp/tree/master/examples/quantize.
Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.
Sagan, Carl. 1995. The Demon-Haunted World: Science as a Candle in the Dark. Random House.
Sheng, Ying, Lianmin Zheng, Binhang Yuan, et al. 2023. “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.” International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2303.06865.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2026. TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” International Conference on Learning Representations (ICLR 2026, Accepted). https://arxiv.org/abs/2504.19874.