Summary
Quantization compresses LLM weights from 16-bit floats down to 4-bit integers, cutting VRAM usage by 70%+ with minimal quality loss. This guide covers how quantization works, what each level costs you in quality, the math for predicting VRAM usage, and why running a model that spills from GPU memory into system RAM can be 5-30x slower. All data referenced from published third-party benchmarks.
What Is Quantization?
Every large language model is a collection of numerical weights. At full precision (FP16), each weight occupies 16 bits of memory. A 14-billion parameter model at FP16 needs roughly 28 GB just for the weights, plus additional memory for the KV cache during inference.
Quantization reduces the number of bits per weight. Instead of storing each value as a 16-bit float, you can approximate it as an 8-bit, 4-bit, or even 3-bit integer with a shared scale factor per block of weights. The model uses less memory, runs faster (less data to move), and loses a small amount of precision.
GGUF is the standard quantization format for local inference via llama.cpp and Ollama. It supports CPU, GPU, and mixed CPU+GPU execution on any hardware. Other formats (AWQ, GPTQ, EXL2) exist but are more specialized.
Key Insight
Going from FP16 to Q4_K_M reduces memory by about 72%. A 14B model drops from ~28 GB to ~8 GB. For most users, this is the difference between "needs a datacenter GPU" and "runs on a gaming card."
Quantization Levels Compared
Not all quantization levels are equal. Higher bit counts preserve more quality but use more memory. The table below summarizes the common GGUF quantization levels with approximate quality retention based on perplexity benchmarks from llama.cpp testing and third-party evaluations.
| Level | Bits/Weight | VRAM vs FP16 | Quality Retention | Best For |
|---|---|---|---|---|
| FP16 | 16 | 100% | 100% | Reference baseline, fine-tuning |
| Q8_0 | 8 | ~50% | ~99% | Maximum local quality |
| Q6_K | 6 | ~38% | ~97% | High quality, more models fit |
| Q5_K_M | 5 | ~31% | ~95% | Recommended for constrained hardware |
| Q4_K_M | 4 | ~28% | ~92% | Mainstream sweet spot |
| IQ4_XS | ~4 (importance) | ~25% | ~95% | Newer method, better quality per bit |
| Q3_K_S | 3 | ~22% | ~88% | Emergency fit, noticeable degradation |
Q4_K_M is the default recommendation for most users. It cuts memory to about 28% of FP16 while retaining roughly 92% of quality as measured by perplexity. The "K" variants use k-quant grouping which preserves outlier weights better than naive uniform quantization. The "M" suffix indicates medium granularity (a balance between the more aggressive "S" and the larger "L" variants).
IQ4_XS is a newer importance-weighted approach that achieves better quality per bit by allocating more precision to weights that matter most. The tradeoff is slightly slower inference on some backends due to the non-uniform bit packing.
The VRAM Math
You can estimate VRAM usage with a simple formula:
VRAM ≈ (Parameters × Bits per Weight / 8) + Context OverheadContext overhead depends on the context window size and model architecture, but typically adds 1-3 GB for standard context lengths (4K-8K tokens). Larger contexts (32K+) can add significantly more.
| Quantization | Weight Size | + Context (~4K) | Total VRAM |
|---|---|---|---|
| FP16 | 28 GB | +2 GB | ~30 GB |
| Q8_0 | 14 GB | +2 GB | ~16 GB |
| Q4_K_M | 7 GB | +2 GB | ~9 GB |
| Q3_K_S | 5.25 GB | +2 GB | ~7.25 GB |
A practical rule of thumb: pick a GGUF quant whose file size is 1-2 GB smaller than your total VRAM. This leaves room for the KV cache and avoids triggering the performance cliff described in the next section.
The RAM Performance Cliff
When a model is too large to fit entirely in GPU memory, llama.cpp and Ollama can "offload" some layers to system RAM. The model still runs, but performance drops dramatically.
| Configuration | Layers in VRAM | Tokens/sec | Slowdown |
|---|---|---|---|
| 8B model, full VRAM | 36/36 | ~40 t/s | 1x (baseline) |
| 8B model, partial offload | 25/36 | ~8 t/s | 5x slower |
| 70B model, heavy offload | ~30% | ~3 t/s | 15x slower |
| 70B model, full RAM spill | 0 | 1-2 t/s | 20-40x slower |
The reason is straightforward: during token generation, the model must read every layer sequentially. Layers in VRAM are accessed at GPU memory bandwidth (hundreds of GB/s). Layers in system RAM must cross the PCIe bus (~32 GB/s for PCIe 4.0 x16, often less in practice). Each offloaded layer adds a PCIe round-trip to every single token generated.
The degradation is not gradual. Even offloading a few layers causes a significant drop because the PCIe bottleneck gates the entire pipeline. The GPU sits idle waiting for data from RAM on every token.
Critical Takeaway
A smaller model that fits entirely in VRAM will almost always outperform a larger model that spills into RAM. Choose the largest model that fits, not the largest model available. A 9B at Q4 doing 23 t/s will feel dramatically faster than a 14B at Q4 doing 3 t/s because it spilled 2 GB into RAM.
MoE Models Change the Math
Mixture-of-Experts (MoE) models like GPT-OSS:20b and qwen3-coder have a different relationship between parameter count and VRAM. GPT-OSS:20b has 21 billion total parameters but only activates 3.6 billion per token. qwen3-coder has 30 billion total but activates only 3.3 billion.
The catch: all parameters must be loaded into memory, even the inactive experts. VRAM usage scales with total parameter count, not active parameters. But compute load scales with active parameters, so MoE models generate tokens faster relative to their total size.
| Model | Total Params | Active Params | VRAM (native) | Relative Speed |
|---|---|---|---|---|
| GPT-OSS:20b | 21B | 3.6B | ~16 GB (MXFP4) | Fast (small active set) |
| qwen3-coder | 30B | 3.3B | ~8.5 GB (Q4) | Fast (small active set) |
| Devstral | 24B (MoE) | ~4B | ~14 GB (Q4) | Moderate |
| qwen3:14b (dense) | 14B | 14B | ~9 GB (Q4) | Moderate (all params active) |
MoE models offer high quality relative to their VRAM cost because the inactive experts still contribute to the model's overall knowledge. They represent some of the best value for users with 16 GB cards.
Practical Recommendations
- Check your VRAM. Run
ollama psor check your GPU specs. This is the hard constraint everything else depends on. - Pick the largest model that fits at Q4_K_M with ~2 GB headroom. The headroom covers the KV cache and prevents VRAM pressure from other applications. See our VRAM model chart for specific models.
- If choosing between sizes, prefer Q4 of the larger model over Q8 of the smaller. A 14B at Q4 (92% quality, ~9 GB) will outperform a 9B at Q8 (99% quality, ~12 GB) on most tasks because the extra 5 billion parameters matter more than the marginal quality from higher precision.
- Never split across VRAM and RAM unless you have no choice. The performance cliff is severe. 3 t/s is not usable for interactive work.
- Consider MoE models. GPT-OSS:20b, qwen3-coder, and Devstral pack more capability per GB of VRAM than dense models of similar quality.
Format Comparison
GGUF is the most common format for local inference, but other options exist with different tradeoffs.
| Format | Backend | CPU+GPU | Best For |
|---|---|---|---|
| GGUF | llama.cpp / Ollama | Yes | Universal, any hardware |
| AWQ | vLLM, TGI | GPU only | NVIDIA serving, fast inference |
| GPTQ | vLLM, AutoGPTQ | GPU only | NVIDIA, older but widely supported |
| EXL2 | ExLlamaV2 | GPU only | Fastest NVIDIA inference |
| MXFP4 | Native (GPT-OSS) | Varies | Hardware-optimized, model-specific |
For AMD GPUs (including RDNA4 with Vulkan), GGUF via Ollama is the only well-supported option. AWQ, GPTQ, and EXL2 are NVIDIA-focused. MXFP4 is a native quantization format used by models like GPT-OSS that are distributed pre-quantized by their creators.
Sources
- llama.cpp quantization benchmarks and perplexity measurements (ggml-org/llama.cpp GitHub discussions)
- "GGUF Quantization: Quality vs Speed on Consumer GPUs" (dasroot.net, February 2026)
- "GGUF Quantization Explained" (willitrunai.com)
- "The Complete Guide to LLM Quantization" (localllm.in)
- "Running Open Source LLMs Locally: Hardware Guide 2026" (apatero.com)
- bartowski GGUF quantization repositories (Hugging Face)