To squeeze a model into less memory than it was born with, you twist its weights into a shape that fits, then twist them back out at the moment you use them.
That twist is the whole story of this chapter. Chapter 22 laid out the quantization methods most operators run today. GGUF K-quants for local CPU and GPU work. GPTQ and AWQ for server-side GPU inference with vLLM or TGI [two popular pieces of software that run models behind an API]. LLM.int8 and bitsandbytes [libraries for running models at 8 bits a weight on consumer hardware] for the 8-bit-on-a-gaming-card case. All four follow the same playbook: look at the distribution of weights in a trained model, pick a rounding scheme that protects the weights that matter most, write the smaller file, ship it.
That playbook has been hitting a wall since 2023. Below 4 bits a weight, the wall is sharp. Naive rounding destroys accuracy. The tricks that got GPTQ and AWQ safely to 4 bits, calibration datasets [a small collection of example text used to guide the rounding], per-channel scales [different rounding step-sizes for different columns of the weight matrix], protecting the channels that carry unusually large numbers, all of them stop working cleanly at 3 bits. Accuracy falls off a cliff. Section 4.4 called the accuracy-versus-bits curve “non-linear” and moved on. This chapter explains why it folds, and what researchers did about it.
The answer, worked out over 2024 and 2025, is rotation. Before you quantize the weights, you multiply them by an orthogonal matrix [a square grid of numbers that, when you multiply by it and then by its mirror image, lands you back where you started]. Doing that shuffles the weights into a new statistical shape. The outliers, the individual weights that were much larger than the rest, get spread evenly across all the coordinates. A lower-bit grid suddenly fits the new shape cleanly. At matmul time [the moment the model multiplies two matrices together to produce the next layer’s output], the rotation gets undone by its inverse, and the arithmetic comes out right.
Every method in this chapter (QuaRot (Ashkboos et al. 2024), SpinQuant (Liu et al. 2024), TurboQuant (Zandieh et al. 2026)) is a different answer to one question: what rotation? Plus one non-rotation method worth pairing with them (EfficientQAT (Chen et al. 2025)), and one production serving system that stitches several of these ideas together (QServe (Lin et al. 2025)).
Come back to the forty-page paper we have been passing through every chapter. You want to summarize it on a local card instead of a cloud API. The model has to fit in VRAM [the memory sitting on the graphics card, shared between the weights and the running conversation]. Weights at 4 bits fit today. Below 4 bits, or once you paste in a long enough paper that the KV cache [the model’s running scratch-memory of everything you have typed so far] swells past the weights themselves, you are in the territory this chapter covers. “Which quantization can I run this paper summary through, and when does it break?” is the question the decision matrix in Section 24.7 answers.
No published paper, as of April 2026, summarizes that full decision space. For most operators reading this book, the matrix is where you go when a colleague asks “which quantization should I use?” The sections before the matrix build the reasoning that makes the matrix trustworthy.
24.1 Why naive low-bit quantization fails
The reason 4 bits is easy and 3 bits is hard comes down to statistics. Specifically, the statistics of what transformer weights look like when you lay them out on a number line. Before we can talk about rotations, we have to look at those distributions, because the rotation story only makes sense once you see what it is trying to fix.
Start with the grid. A quantization grid is the set of values you are allowed to round a weight to. A 4-bit grid has 16 levels. A 3-bit grid has 8. A 2-bit grid has 4. When you quantize a matrix, every weight snaps to the nearest grid point. The rounding error is the distance between where the weight wanted to be and where you had to put it. Two things decide how bad that error gets: how wide your grid has to span, and how many weights you are trying to squeeze into it.
Transformer weights, and the activation values [the numbers flowing through the model as it processes a sentence] that multiply them, have a specific and unpleasant shape. Most of the numbers cluster near zero, in a narrow band you could cover with your thumb. But a small fraction, often less than one percent, are much bigger than the rest. Much bigger. Dettmers and colleagues identified these in the LLM.int8 paper (Dettmers et al. 2022) as activation outliers, and the observation is the single most important empirical fact about LLM quantization. These big values are not noise. They carry information the model needs. Delete them and the model’s behavior collapses.
Here is what that looks like if you can picture it physically. Imagine a ruler laid across a playground. Most of the children are bunched between your feet and the three-yard mark, shoulder to shoulder. One child is standing sixty yards away, at the back fence. Now you have to mark the ruler with sixteen tick marks total, and you need at least one tick near the kid at the fence, because otherwise you lose track of where that child is entirely.
If the ticks are spaced evenly from zero out to sixty yards, every child in the near bunch gets snapped onto the same mark as their neighbor. You cannot tell them apart anymore. The one kid at the fence pulled the ruler wide open, and everybody near your feet paid for it.
That is exactly what outliers do to a quantization grid. The grid has to be wide enough to represent the largest value, or it clips the outlier to its edge and loses how big that weight was. But the grid is fixed-resolution: you only get 16 or 8 or 4 levels total. A grid wide enough to cover an outlier twenty standard deviations out from the middle has grid spacing twenty times what it would have been without the outlier. The 99% of weights near zero snap onto a coarse ladder they should not be on. The outliers hold the grid open. Everything else pays the price.
GPTQ (Frantar et al. 2023) and AWQ (Lin et al. 2024) are both clever answers to this problem. GPTQ uses second-order information [a mathematical measure of which weights the model’s accuracy is most sensitive to] to decide which weights are most fragile under rounding, then picks a rounding order that keeps the total damage small. AWQ notices that the outliers cluster in specific columns of the weight matrix, scales those columns up before quantizing so their effective magnitude shrinks onto the grid, and quietly bakes the compensating scale into the next matrix downstream. SpQR (Dettmers et al. 2024) takes the most direct route of all: find the outliers, store them separately at full precision in a little side-file, quantize everything else to 3 or 4 bits. All three work at 4 bits. All three struggle below 4.
Why they struggle below 4 bits is this: the outliers are shaped. Calibration-based methods assume you can find and isolate them ahead of time. At 4 bits, the grid has enough headroom that a few mistakes in that identification are survivable. At 3 bits, the grid has half the levels, and every channel that was sort of an outlier but did not show up strongly in your calibration set starts clipping. The problem stops being “protect the 1% outliers.” It becomes “the distribution itself is hostile to low-resolution grids, and we need to change the distribution.”
Back to the paper-summary task. If you quantize a 9B model down to 4 bits, you can load it on a consumer card and summarize the paper just fine. If you try to push the same model to 3 bits using GPTQ or AWQ, the first few bullets of the summary will read fine, and then somewhere around bullet three the model will produce a sentence with a number in it that is simply wrong. Not because the model forgot; because the weights that encoded that specific fact got snapped onto a grid that no longer had room for them. This is the failure mode the rotation methods in this chapter exist to fix.
The clause “preserve inner products” is why rotations work at all. Matrix multiplication inside a transformer is almost entirely inner products: take a row of one matrix, line it up against a column of another, multiply pair by pair, add the results. An orthogonal rotation (a square matrix whose inverse equals its transpose) has the property that if you rotate both vectors by the same rotation before taking their inner product, you get exactly the same number out as if you had not rotated them. So you can rotate the weights before quantizing, rotate the activations by the same rotation on the fly as they stream through the model, run the quantized multiply, and get the right answer. The rotation itself costs nothing at matmul time, as long as the rotation matrix is cheap to apply.
24.2 The Hadamard rotation and QuaRot
The cheapest rotation that is mathematically useful is the Hadamard transform. A Hadamard matrix is a square grid of +1s and -1s, the same width as your hidden dimension [the number of coordinates the model uses to represent each token], built so that when you multiply it by its own transpose you get the identity. That qualifies it as an orthogonal rotation. It has a second property that matters enormously in practice: you can apply a Hadamard matrix to a vector of length d in roughly d log d operations instead of the d² a general rotation would cost. The algorithm is called the fast Walsh-Hadamard transform, and it has been standard in signal processing for decades.
QuaRot (Ashkboos et al. 2024) was the 2024 paper that put this together for transformer weights. The recipe is almost annoyingly simple:
- Pick a Hadamard matrix sized to the model’s hidden dimension.
- Multiply every weight matrix by it, on the side that makes the rotations cancel each other out as the data flows through the network. The network still computes exactly the same function it did before. No calibration is needed, because the Hadamard matrix is a fixed mathematical object, not something you derive from data.
- In the rotated basis, the weight and activation distributions have been shuffled by the ±1 mixing. The outliers, which used to live in specific channels, are now spread across every channel.
- Quantize the rotated weights at 4 bits. The distribution is now bell-curve shaped, and 4 bits fits cleanly. At matmul time, apply a small Hadamard transform to the activations to match, run the quantized matmul, and keep going.
The phrase “data-free” is doing real work in the QuaRot paper. QuaRot does not need a calibration dataset, because the Hadamard matrix is not learned from data. You pick it once and you are done. That has two consequences an operator cares about. First, the method is deterministic and reproducible: no “we calibrated on wikitext and the numbers came out slightly different on a different run.” Second, the method generalizes across distributions: a QuaRot-quantized model does not silently get worse when you ask it questions outside whatever calibration set the quantizer happened to see, because there was no calibration set.
QuaRot’s headline result is end-to-end 4-bit inference. Weights at 4 bits, activations at 4 bits, KV cache at 4 bits. That is stronger than what GGUF or AWQ do in practice. The GGUF quantization you run locally is weights-only 4-bit: the weights live at 4 bits on disk and in memory, but the activations flowing through the model are computed at 16 bits, so the 4-bit weights have to be dequantized back to 16 bits before each matrix multiply. End-to-end 4-bit means the multiply itself happens at 4-bit precision, which roughly quadruples the arithmetic throughput on hardware with INT4 tensor cores [special circuitry on newer datacenter GPUs that multiplies 4-bit numbers in bulk]. For datacenter serving, that is the real win. For consumer hardware, which typically does not expose INT4 tensor cores yet, the payoff is memory and bandwidth rather than raw arithmetic. Still substantial, but less dramatic.
QuaRot’s accuracy at 4 bits is about tied with AWQ on standard models like Llama 2. The bigger claim is at 3 bits, where QuaRot stays usable on models where AWQ has collapsed, because the rotation has already absorbed the outlier problem AWQ was designed to patch around.
For the paper-summary task, QuaRot is the first method in this chapter that could, in principle, let you run the summary end-to-end at 4-bit precision on a datacenter card with INT4 support. The model, the activations, the cache, all of it at 4 bits, with throughput measured in papers per second rather than one paper per minute.
24.3 Learned rotations: SpinQuant
QuaRot picks a fixed Hadamard matrix and calls it done. That works well on average. On specific models, especially smaller ones with narrower hidden dimensions or unusually spiky activation distributions, a fixed Hadamard is not quite the right rotation. SpinQuant (Liu et al. 2024), from Meta FAIR, pushes this further by learning the rotation to fit the specific model you are quantizing.
The mathematical subtlety is that you cannot just treat the rotation matrix as a free parameter and train it with ordinary gradient descent. After one step of descent, it would no longer be orthogonal. The “rotation is free at matmul time” property would break, and the accuracy guarantee would break with it.
Here is the picture. The space of all square matrices is an open room you can walk in any direction inside. The space of orthogonal matrices is a curved surface inside that room, like the skin of a globe suspended in the middle. Ordinary gradient descent steps in straight lines through the room. Step off the globe in a straight line and you are no longer on an orthogonal matrix. Step back toward where you wanted to go, and you still are not back on the globe.
SpinQuant uses a method called Cayley optimization to stay on the globe. In plain English, it is a variant of gradient descent that takes each step along the surface of orthogonal matrices rather than through the surrounding room. The rotation stays orthogonal throughout training. The trick is named after Arthur Cayley, the 19th-century mathematician who wrote down the mapping that makes this work. The details do not matter to anyone running a model. What matters is that the optimization is tractable, takes a few hours of GPU time on a small calibration set, and hands you back a rotation tailored to this specific model.
SpinQuant’s empirical results are the expected shape. On easy models (large hidden dimension, well-behaved activations), SpinQuant is about tied with QuaRot. On hard models (small hidden dimension, or distributions with stubborn outliers that a random Hadamard cannot fully disperse), SpinQuant pulls ahead. The paper’s strongest demonstrations are on Llama 3 8B at 4 bits and below, where the learned rotation recovers several points of accuracy that the fixed Hadamard leaves on the table.
The operational cost of SpinQuant versus QuaRot is calibration. SpinQuant needs a small calibration dataset (a few hundred sequences of text, a few megabytes) and a few hours of optimization on a single GPU to learn its rotation. QuaRot runs in minutes because its rotation is fixed. For a model producer shipping one quantized artifact to millions of users, SpinQuant’s extra hours are nothing. For an individual operator quantizing a model for personal use, QuaRot’s zero-calibration story is strictly simpler. Both exist in production; neither is strictly dominated.
For the paper-summary task specifically, the choice between the two methods is not yours to make as a solo operator. You will download whichever quantized release the model’s publisher chose to ship. If the publisher cared enough to learn a SpinQuant rotation, you get slightly better summaries on the same hardware. If they shipped a QuaRot variant, you get reproducible summaries that do not vary depending on who quantized the model or what text they quantized it against. Either way, you can run it.
24.4 TurboQuant and the KV cache problem
TurboQuant (Zandieh et al. 2026), from Zandieh, Daliri, Hadian, and Mirrokni at Google Research (accepted to ICLR 2026), is the newest piece of this story. It solves a problem QuaRot and SpinQuant both punt on. Both of those methods quantize weight matrices and, in some configurations, activations. Neither straightforwardly quantizes the KV cache, the running stream of key and value vectors that attention accumulates as a sequence grows [the model’s scratch-memory of everything that has been said in the conversation so far, stored in a specific format attention can read].
Section 23.2 laid out why the KV cache matters. For any reasonably long context, the KV cache dominates VRAM. A 9B model with 32K of context can carry 8 to 10 GB of KV cache per request, more than the weights themselves at 4 bits. If you quantize the weights down to 4 bits and do nothing about the cache, you hit a new ceiling fast. The cache becomes the bottleneck, and all your hard-won weight-quantization savings get eaten the moment someone asks for a long context.
This is exactly the moment in the paper-summary task when things tip. Short papers fit in the weights budget: the weights quantize cleanly to 4 bits, the cache for a five-page abstract is negligible, and the summary runs. Long papers do not. Paste in a forty-page paper, and the cache for the input text alone pushes you past what 4-bit weights saved you. The weights quantized fine. The cache is now the ceiling. You need to quantize the cache too.
Quantizing the KV cache is harder than quantizing weights, and for a specific reason. The KV cache is online. You do not get to calibrate it once and ship it. Every token the model generates produces a new key vector and a new value vector that must be quantized on the fly, then read back later when subsequent attention steps attend to it. Any method that requires a calibration pass over the full distribution does not apply. Any method that needs to know the outlier channels in advance does not apply. The KV-cache quantizer has to work on each vector as it arrives, with no look-ahead.
TurboQuant’s contribution is a quantization scheme built specifically for that streaming setting. The recipe:
- Rotate each incoming vector by a random orthogonal rotation (cheap, implementable as a composition of Hadamard-like structured transforms). In high enough dimensions, a random rotation pushes the coordinate distribution of the vector toward something bell-curve-shaped, with the outliers dispersed evenly across all coordinates.
- Apply a scalar quantizer, which in plain English means: round each coordinate of the rotated vector independently to the nearest grid point, using a grid tuned to the expected shape. The scalar quantizer is optimal, in a well-defined mathematical sense, for the rotated distribution, because the rotation has broken the outlier structure.
- Store the quantized coordinates. At read time, apply the inverse rotation.
Two further technical ideas make this work. First, the paper derives the optimal scalar quantizer analytically for the rotated distribution (which turns out to be close to Beta-distributed in high dimensions), so the grid levels are not arbitrary. They are chosen to minimize mean-squared error for that specific distribution shape. Second, for attention, the quantity that matters is not the raw vector but the inner product between a query vector and a key vector. That is what attention computes at every step. A single-stage scalar quantizer has a small bias in its estimate of that inner product. TurboQuant stacks a one-bit residual (a tiny binary correction, called a “QJL residual” after the Johnson-Lindenstrauss lemma that motivates it) on top of the main quantizer to cancel the bias. The result is an inner-product estimate that is unbiased and concentrated within a tight error bound, which is exactly what attention needs to produce correct output.
The headline measurement from the paper: KV cache at 3 bits per coordinate with near-zero accuracy loss on Gemma and Mistral across standard benchmarks. At 2.5 bits, accuracy loss is marginal on most tasks. That is dramatically better than what 3-bit weight quantization on its own achieves. TurboQuant does better on the cache than weight-quantization methods do on weights, because the streaming setting let the authors design around online vector statistics directly rather than fixing up the existing weight distribution after the fact.
For operators, TurboQuant is the first method that makes long-context local inference genuinely tractable. If you can quantize both the weights (at 4 bits with a GGUF K-quant or AWQ) and the KV cache (at 3 bits with TurboQuant), you can hold a 128K-token context on hardware that used to choke at 32K. 128K tokens is roughly the length of Crime and Punishment, or four dense academic papers stitched end to end. For the running summary task, that is the difference between feeding the model one paper at a time and feeding it an entire literature review at once. That is not a marginal improvement. That is a step change in what a 16 GB consumer card can do.
The current practical catch: TurboQuant is not yet in mainline llama.cpp. The community tracking thread is llama.cpp Discussion #20969 (llama.cpp contributors 2025), which as of April 2026 documents active experimentation but no merged PR. For NVIDIA users, TurboQuant-style KV quantization is appearing in research branches of vLLM and the Google Research code release. For AMD users on Vulkan, the method is blocked on the upstream llama.cpp work. We flag this in the decision matrix as a “frontier method, ship dependent.”
One more note before leaving this section. The paper’s authors are from Google Research, the same people publishing the method into peer-reviewed venues. We should still run Sagan’s checklist over the central claim, 3-bit KV with near-zero accuracy loss, before treating it as settled. Is the result reproducible? The paper ships code, and independent benchmarks in the llama.cpp discussion thread corroborate the direction. Does a simpler explanation fit? The simpler explanation, that 3-bit KV should simply not work for the reasons outlined in the previous section, is exactly what the paper’s mathematical argument addresses; the reason this is publishable is that the outcome was surprising. Who benefits from the claim being true? Google Research benefits from publications; a claim that overstates results would not have survived ICLR review. All three checks pass, modestly. Not none. Not fully. But enough to warrant including the method in the matrix as frontier rather than either hype or fact.
24.5 The 2-bit frontier: EfficientQAT
The three methods we have seen so far are all post-training quantization. You take a trained model, compress it, serve it. No gradients flow back through the quantized weights during the original training. That is enough to get to 3 bits with care. Getting to 2 bits, cleanly, needs a different tool entirely.
Two bits per weight has four possible values. Regardless of how cleverly you rotate and grid, four values per weight is not enough resolution to capture what the original floating-point weight was doing in most cases. Post-training methods at 2 bits degrade noticeably, often five or ten points of accuracy on hard benchmarks. That is more than an operator can absorb for most production uses. For the paper-summary task, the model at 2 bits post-training will still produce summary-shaped output, but the specific numbers, citations, and technical terms that need to land precisely start coming out subtly wrong in ways the reader cannot easily catch. That failure mode is exactly what we were trying to avoid in Section 1.3, and 2-bit post-training drops us right into it.
Quantization-aware training (QAT) is the alternative. Instead of quantizing after training finishes, you simulate the quantization during a fine-tuning pass. At each training step, the weights round to the low-bit grid, the loss is computed on the rounded weights, and gradients flow back to update the (full-precision) weights in a direction that will round to a better low-bit configuration. After training, you throw away the full-precision shadow weights and keep the 2-bit ones. Because the model has been trained with the quantization in the loop, it has learned to compensate for the rounding error in ways no post-training method can match.
The problem QAT has always had is cost. Running a full fine-tuning pass on a 70-billion-parameter model is expensive, and if you do it only to produce a 2-bit version, you are spending datacenter time to save consumer memory. For most operators, the arithmetic does not work.
EfficientQAT (Chen et al. 2025), from Chen and colleagues (accepted to ACL 2025), is the paper that made the arithmetic work. The insight is block-wise QAT. Instead of backpropagating gradients through the entire 70B model at once, the authors break the model into blocks of layers, then fine-tune each block’s quantized weights to match the full-precision block’s output. Each quantized block is taught to mimic its original. The blocks stitch back together at the end. The training objective is local; the gradient computations stay small enough to run on one GPU.
The headline measurement from the paper: 2-bit quantization of Llama-2-70B, fine-tuned on a single A100 in 41 hours, with less than 3 points of perplexity lost versus the full-precision baseline. Forty-one hours is not nothing, but on a single rented A100 it is within budget for a serious home operator. The resulting model is roughly 18 GB at 2 bits, which fits on a 24 GB consumer card (RTX 3090, 4090, 7900 XTX) with room left over for the KV cache.
What EfficientQAT makes practical is 70B-on-a-single-consumer-GPU as a real deployment pattern. A 70-billion-parameter model holds roughly as many learned numbers as there are leaves on a mature oak tree. At 2 bits each, those numbers pack into 18 GB, about the size of a high-definition movie on your laptop. Before EfficientQAT, a 70B model at 4 bits needed 40+ GB, which meant two 24 GB cards or a datacenter card. At 2 bits with EfficientQAT, one 24 GB card is enough. The accuracy loss of about 3 perplexity points matters for some workloads and is negligible for others. For general assistant use, conversation, code assistance, and yes, paper summaries that are rough-cut rather than citation-precise, it is often not distinguishable in practice from the full-precision model.
The 2025 and 2026 literature has pushed further with variants (ZeroQAT, RCP, StableQAT, and others) that reduce calibration data requirements or stabilize the training. As of April 2026, these are research-stage and we do not have clean operator-facing recommendations for them yet. EfficientQAT is the anchor that has a clean reference implementation, has been reproduced independently, and has been picked up by downstream quantization toolchains. Treat it as the production reference for 2-bit QAT, and treat the follow-ons as “the frontier to watch, not deploy.”
24.6 QServe and serving-stack co-design
The methods so far in this chapter are quantization algorithms. Recipes for turning high-precision weights into low-precision weights with controlled accuracy loss. Running those quantized weights fast on real hardware is a separate problem. You can have a cleanly quantized 4-bit model and get terrible throughput because the serving stack [the software layer that takes incoming requests, batches them, and hands them to the GPU] is not using the low-bit arithmetic efficiently, or because the kernel [the specific piece of GPU code that does the matrix multiply] fights the GPU’s memory hierarchy.
QServe (Lin et al. 2025), from MIT and NVIDIA at MLSys 2025, is the paper that addresses this end-to-end. Its contribution is a co-designed algorithm and system: the quantization format (called W4A8KV4, meaning weights at 4 bits, activations at 8 bits, KV cache at 4 bits) and the serving kernel are designed together, so each accommodates the other. The algorithm is called QoQ (Quattuor-Octo-Quattuor, “four-eight-four” in Latin).
The specific co-design choices matter. W4A8 (weights 4-bit, activations 8-bit) is a deliberate asymmetry. INT8 tensor cores are widely available, and keeping activations at 8 bits relieves the pressure on activation outliers that pure W4A4 would create. KV4 (KV cache at 4 bits) leans on a rotation-based approach similar to QuaRot to make the cache fit without calibration. The kernels are hand-tuned to keep the activation and weight streams aligned in GPU register files, so the quantized multiplication happens with minimal overhead.
The measured throughput gain is 1.2 to 3.5× over TensorRT-LLM on A100 and L40S, depending on the model and batch size. For datacenter serving, that is enormous. Every multiplier on throughput is a direct multiplier on revenue per GPU. QServe has been picked up in production at several large inference providers through 2025 and 2026.
For the paper-summary task, QServe is what is quietly running behind the cloud API you get back when you do not run the summary locally. You do not interact with it. You cannot download it and run it on your laptop. But when an inference provider claims their endpoint is fast and cheap, and the numbers on their page say 3x speedup over a TensorRT-LLM baseline, QServe or something structurally like it is what is making those numbers real.
For consumer operators, QServe is not directly runnable. It targets NVIDIA datacenter GPUs with specific tensor-core support, and the kernel code is CUDA [NVIDIA’s programming language for GPU code]. The reason to know about it anyway is that QServe defines the bar for what a well-engineered quantization-plus-serving stack can deliver. If you are evaluating a managed inference provider, ask whether they use QServe or an equivalent co-designed stack. If they do, the numbers on their page are plausible. If they do not and they claim 3x speedups over TensorRT-LLM, something else is going on, and Sagan’s checklist from Section 1.4 applies.
24.7 Decision matrix: when to reach for which method
This is the chapter’s load-bearing contribution. The preceding sections described what each method does. This section tells you when to reach for each one. No single published paper summarizes this matrix, so we have assembled it from the primary sources and first-party experience.
The matrix is organized by workload class, because the right quantization method depends on where and how you are running the model, not on which paper was most recently published. The running paper-summary task falls into the first two rows if you are running locally, and into the fourth row if you are hitting a hosted API. The rest of the matrix covers deployment patterns you will reach for once “one paper at a time” is not enough.
| Workload | Precision target | Recommended method | Why |
|---|---|---|---|
| Local consumer inference, llama.cpp or Ollama | 4-bit weights | GGUF Q4_K_M (Frantar et al. 2023; llama.cpp contributors 2024) | Mainline support, well-understood, one-line download. Still the default for 99% of local users. |
| Local consumer inference, longer context than weights alone allow | 4-bit weights + KV quant | Q4_K_M weights + Q8_0 or Q4_0 KV cache (today) or TurboQuant KV (when upstream lands) (Zandieh et al. 2026; llama.cpp contributors 2025) | KV quantization is what makes long context on consumer cards possible. TurboQuant is the research answer; mainline llama.cpp KV quant is the production answer today. |
| Local GPU inference, Apple Silicon | 4-bit weights | GGUF Q4_K_M via llama.cpp, or MLX-native 4-bit (Rajput et al. 2025) | MLX is faster on M-series; GGUF is more portable. Either works at 4 bits. |
| GPU serving, vLLM or TGI, 4-bit | 4-bit weights, 16-bit activations | AWQ (Lin et al. 2024) or GPTQ (Frantar et al. 2023) | Native integration in both engines. AWQ typically edges GPTQ on accuracy; GPTQ is more widely available for older models. |
| GPU serving, end-to-end 4-bit matmul | W4A4KV4 | QuaRot (Ashkboos et al. 2024) | Data-free, no calibration, runs cleanly at 4-bit on tensor cores that support INT4. |
| GPU serving, hard models (narrow hidden dim, spiky activations) | W4A4 or below | SpinQuant (Liu et al. 2024) | Learned rotation recovers accuracy that fixed Hadamard leaves on hard distributions. Worth the calibration cost for producer-side shipping. |
| Datacenter serving, maximize throughput per GPU | W4A8KV4 co-designed kernels | QServe / QoQ (Lin et al. 2025) | 1.2 to 3.5x throughput vs TensorRT-LLM. Picks the practical sweet spot across algorithm and system. |
| Consumer deployment of 70B-class models on a 24 GB card | 2-bit weights | EfficientQAT (Chen et al. 2025) | The only method today that delivers 2-bit 70B on a single consumer GPU with tolerable accuracy loss. Needs 40+ hours of A100 time to produce; use a community pre-quantized release if available. |
| Model producer shipping multiple quant tiers | Mix of above | Pipeline: train full-precision, then GPTQ/AWQ for 4-bit, QuaRot/SpinQuant for end-to-end 4-bit, EfficientQAT for 2-bit | Single training run produces a range of quantized releases. What Meta, Qwen, DeepSeek do in practice. |
A few cross-cutting heuristics that do not fit in the table:
- If you are not sure, start with Q4_K_M. It is the default for a reason. The accuracy loss is small, the tooling is mature, and the gains from switching to something more exotic are usually under 5% on throughput or memory. Not enough to justify the operational complexity for a solo operator.
- If your bottleneck is the KV cache rather than the weights, which it will be any time your context runs past 16K on a 7 to 14B model, spend your optimization effort on KV quantization rather than weight quantization. You get more memory back for less accuracy cost.
- If you need end-to-end 4-bit arithmetic (meaning you are running on hardware with INT4 tensor cores and you want to use them), you need a rotation-based method. Weight-only 4-bit quantization will not deliver the tensor-core speedup, because the matmul has to dequantize back to 16-bit first.
- If you are doing research or model compression work rather than serving production, the rotation methods (QuaRot, SpinQuant, TurboQuant) and EfficientQAT are the current frontier to engage with. The 2025 QAT follow-ons (ZeroQAT, RCP, StableQAT) are further out. Read the papers; do not build production on them yet.
24.8 What the frontier looks like next
The rotation family is not finished. Three lines of active work as of April 2026, each at a different research stage:
PolarQuant and related vector-quantization methods. The same Google Research group behind TurboQuant has follow-on work (PolarQuant, accepted to AISTATS 2026, and earlier Quantized Johnson-Lindenstrauss work) that extends the online vector-quantization idea to matrix-valued streams and to higher-precision regimes. The technical thread is the same: design the quantizer around the geometry of the rotated distribution and around what attention needs to compute. The practical payoff, if and when these reach mainline serving stacks, will be more aggressive KV-cache compression (2-bit or below) without accuracy loss. For the paper-summary task, that would stretch the cache budget far enough that a 16 GB card could hold an entire short book in context at once.
Mixture-of-experts specific methods. MoE models (DeepSeek-V3, Llama 4, Qwen 3.5 MoE) have a quantization problem dense models do not. Different experts see different data distributions during training, so a single calibration set does not characterize all of them cleanly. Expert-specific calibration, dynamic per-expert bit allocation, and routed quantization are the active research directions. No clean operator story yet. The field is still debating what the right abstraction is.
Extreme low-bit with activation awareness. The methods in this chapter quantize weights aggressively and activations less aggressively. Pushing activations down to 4 bits or lower, which is what end-to-end low-bit serving would require, runs back into the activation outlier problem that was the origin of this whole chapter. Several 2025 papers (QServe’s W4A8 is one; others are exploring W4A4 with more elaborate rotation schemes) are attempting this with mixed results. Watch the space; do not deploy on it.
There is a deeper thing to admit here. We do not fully understand why a random orthogonal rotation is usually enough. The empirical observation is that in high dimensions, rotating a vector by a random orthogonal matrix smears its outliers evenly. The theoretical story traces to the Johnson-Lindenstrauss lemma and the geometry of high-dimensional bell curves. But the leap from “a random rotation tames outliers in theory” to “this specific rotation of this specific weight matrix recovers this many perplexity points on this specific model” is still an empirical science, not a predictive theory. The field is in the middle of working out why some rotations are better than others on specific models, and every new rotation method is, partly, a guess that turns out to work.
For a solo operator or small team in mid-2026, the practical takeaway is that the quantization ground under your feet is still moving. The matrix above is the current state. It will be revised in eighteen months, because one of the frontier methods above will graduate into mainline llama.cpp or vLLM and change what the default should be. Part of the reason this chapter is structured around the rotation intuition rather than the specific methods is that the intuition is likely to outlast any particular paper. When the next TurboQuant arrives, you will recognize it, because it will be answering the same three questions (what rotation, what to quantize, what system support) in a new way.
24.9 Three things to carry forward
- The rotation idea is where low-bit quantization lives. Below 4 bits, every serious method is a rotation followed by a quantizer followed by an inverse rotation. Whether the rotation is fixed (QuaRot), learned (SpinQuant), or random-per-vector (TurboQuant) is a choice of recipe, not a choice of paradigm.
- KV cache quantization is the next lever for long-context local inference. Once weights are at 4 bits, the cache is the bottleneck. TurboQuant is the research answer. Mainline llama.cpp Q8_0 KV is the practical answer today. Expect this to keep moving.
- The decision matrix in Section 24.7 is the operator-facing artifact. When someone asks “which quantization,” that table is your answer. It is the chapter’s load-bearing contribution because no existing paper summarizes it, and because picking the wrong method wastes real hardware capacity for months before anyone notices.