Research
BenchmarksquantizationGGUFQ4_K_MVRAMRAM offloadinglocal LLM performanceGPU memory

Quantization Explained: VRAM, Quality, and the RAM Performance Cliff

Kevin8 min read

Summary

Quantization compresses LLM weights from 16-bit floats down to 4-bit integers, cutting VRAM usage by 70%+ with minimal quality loss. This guide covers how quantization works, what each level costs you in quality, the math for predicting VRAM usage, and why running a model that spills from GPU memory into system RAM can be 5-30x slower. All data referenced from published third-party benchmarks.

What Is Quantization?

Every large language model is a collection of numerical weights. At full precision (FP16), each weight occupies 16 bits of memory. A 14-billion parameter model at FP16 needs roughly 28 GB just for the weights, plus additional memory for the KV cache during inference.

Quantization reduces the number of bits per weight. Instead of storing each value as a 16-bit float, you can approximate it as an 8-bit, 4-bit, or even 3-bit integer with a shared scale factor per block of weights. The model uses less memory, runs faster (less data to move), and loses a small amount of precision.

GGUF is the standard quantization format for local inference via llama.cpp and Ollama. It supports CPU, GPU, and mixed CPU+GPU execution on any hardware. Other formats (AWQ, GPTQ, EXL2) exist but are more specialized.

Key Insight

Going from FP16 to Q4_K_M reduces memory by about 72%. A 14B model drops from ~28 GB to ~8 GB. For most users, this is the difference between "needs a datacenter GPU" and "runs on a gaming card."

Quantization Levels Compared

Not all quantization levels are equal. Higher bit counts preserve more quality but use more memory. The table below summarizes the common GGUF quantization levels with approximate quality retention based on perplexity benchmarks from llama.cpp testing and third-party evaluations.

Table 1. GGUF quantization levels (approximate, model-dependent)
LevelBits/WeightVRAM vs FP16Quality RetentionBest For
FP1616100%100%Reference baseline, fine-tuning
Q8_08~50%~99%Maximum local quality
Q6_K6~38%~97%High quality, more models fit
Q5_K_M5~31%~95%Recommended for constrained hardware
Q4_K_M4~28%~92%Mainstream sweet spot
IQ4_XS~4 (importance)~25%~95%Newer method, better quality per bit
Q3_K_S3~22%~88%Emergency fit, noticeable degradation

Q4_K_M is the default recommendation for most users. It cuts memory to about 28% of FP16 while retaining roughly 92% of quality as measured by perplexity. The "K" variants use k-quant grouping which preserves outlier weights better than naive uniform quantization. The "M" suffix indicates medium granularity (a balance between the more aggressive "S" and the larger "L" variants).

IQ4_XS is a newer importance-weighted approach that achieves better quality per bit by allocating more precision to weights that matter most. The tradeoff is slightly slower inference on some backends due to the non-uniform bit packing.

The VRAM Math

You can estimate VRAM usage with a simple formula:

VRAM ≈ (Parameters × Bits per Weight / 8) + Context Overhead

Context overhead depends on the context window size and model architecture, but typically adds 1-3 GB for standard context lengths (4K-8K tokens). Larger contexts (32K+) can add significantly more.

Table 2. VRAM estimates for a 14B parameter model
QuantizationWeight Size+ Context (~4K)Total VRAM
FP1628 GB+2 GB~30 GB
Q8_014 GB+2 GB~16 GB
Q4_K_M7 GB+2 GB~9 GB
Q3_K_S5.25 GB+2 GB~7.25 GB

A practical rule of thumb: pick a GGUF quant whose file size is 1-2 GB smaller than your total VRAM. This leaves room for the KV cache and avoids triggering the performance cliff described in the next section.

The RAM Performance Cliff

When a model is too large to fit entirely in GPU memory, llama.cpp and Ollama can "offload" some layers to system RAM. The model still runs, but performance drops dramatically.

Table 3. Performance impact of GPU/RAM split (third-party benchmarks)
ConfigurationLayers in VRAMTokens/secSlowdown
8B model, full VRAM36/36~40 t/s1x (baseline)
8B model, partial offload25/36~8 t/s5x slower
70B model, heavy offload~30%~3 t/s15x slower
70B model, full RAM spill01-2 t/s20-40x slower

The reason is straightforward: during token generation, the model must read every layer sequentially. Layers in VRAM are accessed at GPU memory bandwidth (hundreds of GB/s). Layers in system RAM must cross the PCIe bus (~32 GB/s for PCIe 4.0 x16, often less in practice). Each offloaded layer adds a PCIe round-trip to every single token generated.

The degradation is not gradual. Even offloading a few layers causes a significant drop because the PCIe bottleneck gates the entire pipeline. The GPU sits idle waiting for data from RAM on every token.

Critical Takeaway

A smaller model that fits entirely in VRAM will almost always outperform a larger model that spills into RAM. Choose the largest model that fits, not the largest model available. A 9B at Q4 doing 23 t/s will feel dramatically faster than a 14B at Q4 doing 3 t/s because it spilled 2 GB into RAM.

MoE Models Change the Math

Mixture-of-Experts (MoE) models like GPT-OSS:20b and qwen3-coder have a different relationship between parameter count and VRAM. GPT-OSS:20b has 21 billion total parameters but only activates 3.6 billion per token. qwen3-coder has 30 billion total but activates only 3.3 billion.

The catch: all parameters must be loaded into memory, even the inactive experts. VRAM usage scales with total parameter count, not active parameters. But compute load scales with active parameters, so MoE models generate tokens faster relative to their total size.

Table 4. MoE vs Dense: VRAM and speed comparison
ModelTotal ParamsActive ParamsVRAM (native)Relative Speed
GPT-OSS:20b21B3.6B~16 GB (MXFP4)Fast (small active set)
qwen3-coder30B3.3B~8.5 GB (Q4)Fast (small active set)
Devstral24B (MoE)~4B~14 GB (Q4)Moderate
qwen3:14b (dense)14B14B~9 GB (Q4)Moderate (all params active)

MoE models offer high quality relative to their VRAM cost because the inactive experts still contribute to the model's overall knowledge. They represent some of the best value for users with 16 GB cards.

Practical Recommendations

  1. Check your VRAM. Run ollama ps or check your GPU specs. This is the hard constraint everything else depends on.
  2. Pick the largest model that fits at Q4_K_M with ~2 GB headroom. The headroom covers the KV cache and prevents VRAM pressure from other applications. See our VRAM model chart for specific models.
  3. If choosing between sizes, prefer Q4 of the larger model over Q8 of the smaller. A 14B at Q4 (92% quality, ~9 GB) will outperform a 9B at Q8 (99% quality, ~12 GB) on most tasks because the extra 5 billion parameters matter more than the marginal quality from higher precision.
  4. Never split across VRAM and RAM unless you have no choice. The performance cliff is severe. 3 t/s is not usable for interactive work.
  5. Consider MoE models. GPT-OSS:20b, qwen3-coder, and Devstral pack more capability per GB of VRAM than dense models of similar quality.

Format Comparison

GGUF is the most common format for local inference, but other options exist with different tradeoffs.

Table 5. Quantization format comparison
FormatBackendCPU+GPUBest For
GGUFllama.cpp / OllamaYesUniversal, any hardware
AWQvLLM, TGIGPU onlyNVIDIA serving, fast inference
GPTQvLLM, AutoGPTQGPU onlyNVIDIA, older but widely supported
EXL2ExLlamaV2GPU onlyFastest NVIDIA inference
MXFP4Native (GPT-OSS)VariesHardware-optimized, model-specific

For AMD GPUs (including RDNA4 with Vulkan), GGUF via Ollama is the only well-supported option. AWQ, GPTQ, and EXL2 are NVIDIA-focused. MXFP4 is a native quantization format used by models like GPT-OSS that are distributed pre-quantized by their creators.

Sources

  • llama.cpp quantization benchmarks and perplexity measurements (ggml-org/llama.cpp GitHub discussions)
  • "GGUF Quantization: Quality vs Speed on Consumer GPUs" (dasroot.net, February 2026)
  • "GGUF Quantization Explained" (willitrunai.com)
  • "The Complete Guide to LLM Quantization" (localllm.in)
  • "Running Open Source LLMs Locally: Hardware Guide 2026" (apatero.com)
  • bartowski GGUF quantization repositories (Hugging Face)