Quantization Explained: VRAM, Quality, and the RAM Performance Cliff

Summary

Quantization compresses LLM weights from 16-bit floats down to 4-bit integers, cutting VRAM usage by 70%+ with minimal quality loss. This guide covers how quantization works, what each level costs you in quality, the math for predicting VRAM usage, and why running a model that spills from GPU memory into system RAM can be 5-30x slower. All data referenced from published third-party benchmarks.

What Is Quantization?

Every large language model is a collection of numerical weights. At full precision (FP16), each weight occupies 16 bits of memory. A 14-billion parameter model at FP16 needs roughly 28 GB just for the weights, plus additional memory for the KV cache during inference.

Quantization reduces the number of bits per weight. Instead of storing each value as a 16-bit float, you can approximate it as an 8-bit, 4-bit, or even 3-bit integer with a shared scale factor per block of weights. The model uses less memory, runs faster (less data to move), and loses a small amount of precision.

GGUF is the standard quantization format for local inference via llama.cpp and Ollama. It supports CPU, GPU, and mixed CPU+GPU execution on any hardware. Other formats (AWQ, GPTQ, EXL2) exist but are more specialized.

Key Insight

Going from FP16 to Q4_K_M reduces memory by about 72%. A 14B model drops from ~28 GB to ~8 GB. For most users, this is the difference between "needs a datacenter GPU" and "runs on a gaming card."

Quantization Levels Compared

Not all quantization levels are equal. Higher bit counts preserve more quality but use more memory. The table below summarizes the common GGUF quantization levels with approximate quality retention based on perplexity benchmarks from llama.cpp testing and third-party evaluations.

Table 1. GGUF quantization levels (approximate, model-dependent)

Level	Bits/Weight	VRAM vs FP16	Quality Retention	Best For
FP16	16	100%	100%	Reference baseline, fine-tuning
Q8_0	8	~50%	~99%	Maximum local quality
Q6_K	6	~38%	~97%	High quality, more models fit
Q5_K_M	5	~31%	~95%	Recommended for constrained hardware
Q4_K_M	4	~28%	~92%	Mainstream sweet spot
IQ4_XS	~4 (importance)	~25%	~95%	Newer method, better quality per bit
Q3_K_S	3	~22%	~88%	Emergency fit, noticeable degradation

Q4_K_M is the default recommendation for most users. It cuts memory to about 28% of FP16 while retaining roughly 92% of quality as measured by perplexity. The "K" variants use k-quant grouping which preserves outlier weights better than naive uniform quantization. The "M" suffix indicates medium granularity (a balance between the more aggressive "S" and the larger "L" variants).

IQ4_XS is a newer importance-weighted approach that achieves better quality per bit by allocating more precision to weights that matter most. The tradeoff is slightly slower inference on some backends due to the non-uniform bit packing.

The VRAM Math

You can estimate VRAM usage with a simple formula:

VRAM ≈ (Parameters × Bits per Weight / 8) + Context Overhead

Context overhead depends on the context window size and model architecture, but typically adds 1-3 GB for standard context lengths (4K-8K tokens). Larger contexts (32K+) can add significantly more.

Table 2. VRAM estimates for a 14B parameter model

Quantization	Weight Size	+ Context (~4K)	Total VRAM
FP16	28 GB	+2 GB	~30 GB
Q8_0	14 GB	+2 GB	~16 GB
Q4_K_M	7 GB	+2 GB	~9 GB
Q3_K_S	5.25 GB	+2 GB	~7.25 GB

A practical rule of thumb: pick a GGUF quant whose file size is 1-2 GB smaller than your total VRAM. This leaves room for the KV cache and avoids triggering the performance cliff described in the next section.

The RAM Performance Cliff

When a model is too large to fit entirely in GPU memory, llama.cpp and Ollama can "offload" some layers to system RAM. The model still runs, but performance drops dramatically.

Table 3. Performance impact of GPU/RAM split (third-party benchmarks)

Configuration	Layers in VRAM	Tokens/sec	Slowdown
8B model, full VRAM	36/36	~40 t/s	1x (baseline)
8B model, partial offload	25/36	~8 t/s	5x slower
70B model, heavy offload	~30%	~3 t/s	15x slower
70B model, full RAM spill	0	1-2 t/s	20-40x slower

The reason is straightforward: during token generation, the model must read every layer sequentially. Layers in VRAM are accessed at GPU memory bandwidth (hundreds of GB/s). Layers in system RAM must cross the PCIe bus (~32 GB/s for PCIe 4.0 x16, often less in practice). Each offloaded layer adds a PCIe round-trip to every single token generated.

The degradation is not gradual. Even offloading a few layers causes a significant drop because the PCIe bottleneck gates the entire pipeline. The GPU sits idle waiting for data from RAM on every token.

Critical Takeaway

A smaller model that fits entirely in VRAM will almost always outperform a larger model that spills into RAM. Choose the largest model that fits, not the largest model available. A 9B at Q4 doing 23 t/s will feel dramatically faster than a 14B at Q4 doing 3 t/s because it spilled 2 GB into RAM.

MoE Models Change the Math

Mixture-of-Experts (MoE) models like GPT-OSS:20b and qwen3-coder have a different relationship between parameter count and VRAM. GPT-OSS:20b has 21 billion total parameters but only activates 3.6 billion per token. qwen3-coder has 30 billion total but activates only 3.3 billion.

The catch: all parameters must be loaded into memory, even the inactive experts. VRAM usage scales with total parameter count, not active parameters. But compute load scales with active parameters, so MoE models generate tokens faster relative to their total size.

Table 4. MoE vs Dense: VRAM and speed comparison

Model	Total Params	Active Params	VRAM (native)	Relative Speed
GPT-OSS:20b	21B	3.6B	~16 GB (MXFP4)	Fast (small active set)
qwen3-coder	30B	3.3B	~8.5 GB (Q4)	Fast (small active set)
Devstral	24B (MoE)	~4B	~14 GB (Q4)	Moderate
qwen3:14b (dense)	14B	14B	~9 GB (Q4)	Moderate (all params active)

MoE models offer high quality relative to their VRAM cost because the inactive experts still contribute to the model's overall knowledge. They represent some of the best value for users with 16 GB cards.

Practical Recommendations

Check your VRAM. Run ollama ps or check your GPU specs. This is the hard constraint everything else depends on.
Pick the largest model that fits at Q4_K_M with ~2 GB headroom. The headroom covers the KV cache and prevents VRAM pressure from other applications. See our VRAM model chart for specific models.
If choosing between sizes, prefer Q4 of the larger model over Q8 of the smaller. A 14B at Q4 (92% quality, ~9 GB) will outperform a 9B at Q8 (99% quality, ~12 GB) on most tasks because the extra 5 billion parameters matter more than the marginal quality from higher precision.
Never split across VRAM and RAM unless you have no choice. The performance cliff is severe. 3 t/s is not usable for interactive work.
Consider MoE models. GPT-OSS:20b, qwen3-coder, and Devstral pack more capability per GB of VRAM than dense models of similar quality.

Format Comparison

GGUF is the most common format for local inference, but other options exist with different tradeoffs.

Table 5. Quantization format comparison

Format	Backend	CPU+GPU	Best For
GGUF	llama.cpp / Ollama	Yes	Universal, any hardware
AWQ	vLLM, TGI	GPU only	NVIDIA serving, fast inference
GPTQ	vLLM, AutoGPTQ	GPU only	NVIDIA, older but widely supported
EXL2	ExLlamaV2	GPU only	Fastest NVIDIA inference
MXFP4	Native (GPT-OSS)	Varies	Hardware-optimized, model-specific

For AMD GPUs (including RDNA4 with Vulkan), GGUF via Ollama is the only well-supported option. AWQ, GPTQ, and EXL2 are NVIDIA-focused. MXFP4 is a native quantization format used by models like GPT-OSS that are distributed pre-quantized by their creators.

Sources

llama.cpp quantization benchmarks and perplexity measurements (ggml-org/llama.cpp GitHub discussions)
"GGUF Quantization: Quality vs Speed on Consumer GPUs" (dasroot.net, February 2026)
"GGUF Quantization Explained" (willitrunai.com)
"The Complete Guide to LLM Quantization" (localllm.in)
"Running Open Source LLMs Locally: Hardware Guide 2026" (apatero.com)
bartowski GGUF quantization repositories (Hugging Face)