First-party measurements that appear in chapters as summary claims. The raw data lives here so you can audit the setup, the hardware, and the configuration behind every first-person benchmark in the book. All measurements were taken on May 5, 2026.
Hardware
The benchmarks below were run on a rented GPU rather than a personal-owned card. The reasons are documented in 21 Local vs. Cloud and are not load-bearing for the numbers themselves: a comparable-class consumer card produces comparable-class numbers on this workload.
- GPU: NVIDIA RTX A5000, 24 GB VRAM, Ampere architecture, compute capability 8.6
- Driver: 570.211.01, CUDA 12.4 toolkit, CUDA 12.8 runtime
- Host CPU: AMD Ryzen 9 9900X, 12 cores / 24 threads
- Host RAM: 128 GB DDR5
- Storage: NVMe SSD on the host
- Provider: RunPod community cloud, US-East region
- Provider cost: approximately $0.34 per hour for the GPU instance
- Software:
llama.cppbuild commitbbeb89d(built from source withGGML_CUDA=ON), Ubuntu 22.04 LTS
The A5000 is roughly comparable to a consumer RTX 4080 in memory bandwidth (~768 GB/s) and slightly behind it in raw compute throughput. The numbers below should reproduce within a few percent on a 4080. They will trail an RTX 4090 by 30 to 40 percent on memory-bandwidth-bound work (decode), and by 50 percent or more on compute-bound work (prefill). They will trail an H100 by an order of magnitude. The point of these numbers is to show the shape of the trade-off across quantization levels, not to argue an absolute throughput claim.
Quantization sweep: Qwen3-8B
The model is Alibaba’s Qwen3-8B, a transformer with 8.19 billion parameters (Qwen Team 2025). Quantizations are sourced from bartowski/Qwen_Qwen3-8B-GGUF on Hugging Face, downloaded fresh for each sweep run.
The sweep covers seven levels from Q8_0 (8 bits per weight, the highest-quality GGUF format) down to Q2_K (2-bit, on the lower edge of usability per 22 Quantization in Depth). For each level, three measurements were taken:
llama-benchrecorded prompt-processing throughput (pp512, 512 input tokens through the model in one batch, dominantly compute-bound) and token-generation throughput (tg128, 128 output tokens generated one at a time, dominantly memory-bandwidth-bound). All layers were offloaded to the GPU (-ngl 99). Three repetitions per measurement, mean reported.llama-perplexityevaluated each quant against the wikitext-2-raw test split (~290K tokens, 145 chunks of context length 2048). One pass per quant, since the binary itself reports a within-run standard deviation.
| Quant | File size (MB) | pp512 (t/s) | tg128 (t/s) | PPL (wikitext-2) |
|---|---|---|---|---|
| Q8_0 | 8,307 | 3,773 | 73.0 | 8.69 ± 0.067 |
| Q6_K | 6,415 | 3,257 | 77.3 | 8.68 ± 0.067 |
| Q5_K_M | 5,581 | 3,515 | 94.5 | 8.70 ± 0.067 |
| Q4_K_M | 4,795 | 3,613 | 104.2 | 8.76 ± 0.067 |
| Q4_0 | 4,566 | 3,839 | 115.6 | 8.83 ± 0.067 |
| Q3_K_M | 3,934 | 3,374 | 85.6 | 9.07 ± 0.070 |
| Q2_K | 3,130 | 2,931 | 96.0 | 9.58 ± 0.072 |
The shape, not the absolute numbers, is the load-bearing claim:
- Token generation tracks file size, except where it doesn’t. From Q8_0 to Q4_0, smaller files generate faster (73 to 116 t/s), exactly what memory-bandwidth-bound work predicts: fewer bytes per weight means fewer bytes pulled across the bus per token. The exception is the K-quants below Q4_K_M. Q3_K_M and Q2_K have more file-size overhead per useful weight (the K-block format carries metadata) and decode through more involved per-block code paths, so they don’t keep accelerating linearly with file size. Q4_0 wins outright: smallest weights, simplest dequant kernel, no K-block metadata cost.
- Prompt processing barely moves. pp512 numbers cluster between 2,900 and 3,800 t/s with no monotone relationship to file size. Prefill is compute-bound on the A5000, so bigger weight files don’t slow you down: you’re already bottlenecked on FMA throughput, not bandwidth. The ~25 percent spread is mostly which kernel the runtime selects per quant format, not the quant format’s “speed.”
- The quality cliff is at Q3, not earlier. From Q8_0 to Q4_0, perplexity moves 8.69 → 8.83, a delta of 0.14, or about two times the test’s own noise floor. Q6_K is fractionally lower than Q8_0 (8.68 vs 8.69, well within error bars), essentially noise: at the high end of the quant scale, the differences are below what wikitext-2 can resolve. Q3_K_M jumps to 9.07 (+0.38 vs Q8_0, ~5x noise) and Q2_K to 9.58 (+0.89, ~13x noise). The quality cliff is sharp, sharply located, and consistent with the perplexity literature.
- The Q4_K_M sweet spot is real, and now measured rather than asserted. It costs ~58 percent of the Q8_0 file size, runs prompt-processing at ~96 percent of Q8_0’s rate, generates tokens ~43 percent faster, and pays 0.07 perplexity points (about one standard deviation of the test) for those gains. That is the strongest argument the dataset can make: by the test’s own resolution, you cannot statistically distinguish Q4_K_M from Q8_0 on this corpus, but you can clearly distinguish them on speed and disk.
- Q4_0 is faster but slightly worse. It’s the throughput champion in this sweep, but adds 0.13 perplexity points over Q8_0 (~2x noise) versus Q4_K_M’s 0.07. The legacy Q4_0 format is also more sensitive to outlier weights than Q4_K_M’s mixed-precision blocks, so quality variance across other models is wider than this single Qwen3 number suggests. If you’re pinned to one model and have measured it, Q4_0 is a real option. If you’re rotating models, Q4_K_M is the safer default.
- Q2_K is faster than Q3_K_M. This is the only “wrong direction” speed result in the sweep. Q2_K uses fewer per-block scales and a simpler dequant path; Q3_K_M’s more elaborate format costs more per token to decode than its smaller file size saves on bandwidth. Don’t read this as “Q2_K is good”: the +0.89 perplexity tax says otherwise. Read it as “below Q4, the speed/quality frontier stops being a simple curve, and the perplexity cliff is steep.”
The tg128 standard deviations across three repetitions are all below 0.5 t/s; reproducibility on a quiet box is good. The pp512 stddevs are noisier in absolute terms (30-60 t/s on a ~3,500 t/s mean) but still under 2 percent, not enough to change any of the qualitative claims above. The perplexity error bars (±0.067 to ±0.072) are reported by llama-perplexity itself across the chunks of the wikitext-2 test set; a “real” multi-seed perplexity standard deviation would be slightly larger, so treat any single quant-to-quant delta smaller than ~0.10 as inside the noise.
Context-length sweep: where prefill stops being free
A separate llama-bench run on Q4_K_M measured prompt-processing throughput at five context lengths from 512 to 32,768 tokens. Two repetitions per measurement. Token-generation was measured at the default short-context setting (tg128).
| Test | Context length | Throughput (t/s) |
|---|---|---|
| pp512 | 512 | 3,508 ± 70 |
| pp2048 | 2,048 | 3,071 ± 11 |
| pp8192 | 8,192 | 2,144 ± 5 |
| pp16384 | 16,384 | 1,511 ± 5 |
| pp32768 | 32,768 | 872 ± 0.4 |
| tg128 | n/a | 96.2 ± 0.5 |
Prefill throughput drops by roughly half for every 4x increase in context. Going from 512 to 32,768 tokens, prompt-processing slows by 4x even though the prompt is 64x longer, because the dominant non-attention work is linear in context while attention itself is quadratic. The quadratic term doesn’t dominate yet at these lengths on an 8B model, but you can see it pulling the curve away from linear: the pp8192 → pp16384 transition costs you 30 percent throughput, and pp16384 → pp32768 costs another 42 percent.
The practical translation: a 32K-token prompt does not take 64 times as long as a 512-token prompt to prefill. It takes about 16 times as long (32K / (872 t/s) ≈ 38 seconds, versus 512 / (3,508 t/s) ≈ 0.15 seconds). That is still a meaningful penalty if you are building latency-sensitive interfaces, but it’s much more forgiving than the worst-case attention math would predict for a model this size. The KV cache budget (Chapter 23) starts to matter long before the throughput cliff does: at 32,768 tokens, the cache for an 8B model in FP16 is roughly 2 GB on top of the ~5 GB the Q4_K_M weights occupy, leaving ~17 GB of the 24 GB A5000 unused but unavailable for stretching context further.
Temperature sweep: how much does output actually change?
A single quantization level (Q4_K_M) was run against two fixed prompts at five temperature settings each. Seed pinned to 42 across all runs. Output capped at 256 tokens for the constrained prompt and 200 for the open-ended one. The point is to observe how output character changes with temperature on the same input, not to score correctness.
Constrained prompt (high-confidence task)
Prompt: “Summarize the following abstract in exactly three bullet points, one sentence each.” (followed by the GPT-3 paper abstract as input).
| Temp | First bullet (verbatim opening) |
|---|---|
| 0.0 | “GPT-3 is a large autoregressive language model with 175 billion parameters, significantly larger than previous non-sparse models.” |
| 0.3 | identical to T=0.0 |
| 0.5 | identical to T=0.0 |
| 0.7 | “…with 175 billion parameters, ten times larger than previous non-sparse models.” |
| 1.0 | identical to T=0.7 |
Bullets two and three were word-for-word identical across all five temperatures. The only variance was a single phrase in bullet one: “significantly larger” at T ≤ 0.5, “ten times larger” at T ≥ 0.7.
This is the part of the temperature picture most users miss. On a constrained task where the right answer is high-probability at every step, the next-token distribution is so peaked that even with the temperature softmax flattened by a factor of two, the same token still wins. T=0.7, the de facto field default, is essentially a no-op here. You only see variance when the model is genuinely uncertain. Summarizing a well-written abstract into three bullets, given a fixed seed, is not that situation.
The corollary: if you are running a structured-extraction task and switching temperature does not change your outputs, that is not a bug. It is the model telling you the task is below the noise floor of its own sampling.
Open-ended prompt (low-confidence task)
Prompt: “Write the opening paragraph of a short story about a librarian who discovers their library has been quietly rearranging itself overnight.”
| Temp | Librarian’s name | Distinguishing detail |
|---|---|---|
| 0.0 | Eleanor | “old Carnegie Library”, “chill ran down her spine” |
| 0.3 | Eleanor | unnamed library, ends “with a purpose she hadn’t yet understood” |
| 0.7 | Elara | invokes Dewey Decimal, ends with the spine whispering her own name |
| 1.0 | Clara | library “sleepwalking through the night, reshaping its own heart” |
| 1.3 | Elara | “deliberate rearrangement, as if the library itself had been whispering secrets in the night” |
Now temperature actually does something. Names rotate (Eleanor → Elara → Clara → Elara), the setting specifies and unspecifies itself, and the tonal register drifts: at T=0.0 the prose is competent and conventional; at T=0.7 the model takes a small supernatural turn (the spine whispers); at T=1.0 the metaphor stretches further (the library is “sleepwalking, reshaping its own heart”); at T=1.3 the prose stays coherent but is noticeably more committed to a single creepy mood.
Two observations:
- Even at T=1.3 the output is not gibberish. The “high temperature breaks the model” intuition is overstated for current-generation chat-tuned models on prompts they are well-suited to. The chat template applies its own structural constraints: paragraph shape, sentence length distribution, and the implicit “be a useful assistant” prior all survive aggressive sampling. The thing that varies is which coherent answer the model commits to, not whether it stays coherent.
- The dynamic range is real but narrow. All five outputs are recognizably the same scene: a female librarian, fingers on spines, a quiet wrongness. None of the temperature settings produces, say, a male librarian, a comedy register, or a setting outside a brick-and-mortar library. The model’s prior over “librarian discovers the library is rearranging itself” is strong; temperature reshuffles within that prior, it does not break out of it. If you want genuinely different responses, vary the prompt, not the temperature.
The pp512 numbers during these runs (1,547 to 1,602 t/s for prompt processing, 108.7 to 109.6 t/s for generation) are lower than the bench-sweep numbers because the chat template materially expands the prompt, but they confirm the throughput is stable across temperature settings. Temperature changes which token gets sampled; it does not change how fast the model can sample.
Cost comparison: paper-summary task
The recurring example in the book is summarizing a 40-page paper into 5 bullet points. Token shape: roughly 10,000 input tokens, 500 output tokens. The same task run across providers and locally costs the following, per task:
| Path | Approximate cost per task | Source of the number |
|---|---|---|
| Local on owned consumer card | ~$0.00006 | 7.6 s wall-clock at ~200 W draw, $0.15/kWh residential |
| Local on rented A5000 | ~$0.0007 | 7.6 s × $0.34/hr instance time |
| Gemini 2.5 Flash | ~$0.004 | $0.30/M input + $1.20/M output, May 2026 list price |
| Claude Haiku 4.5 | ~$0.010 | $0.80/M input + $4.00/M output, May 2026 list price |
| GPT-5.4 Mini | ~$0.011 | $0.30/M input + $1.20/M output, May 2026 list price |
| Claude Sonnet 4.6 | ~$0.038 | $3/M input + $15/M output, May 2026 list price |
| GPT-5.4 (full) | ~$0.038 | $3/M input + $15/M output, May 2026 list price |
| Gemini 3.1 Pro | ~$0.038 | $3/M input + $15/M output, May 2026 list price |
| Claude Opus 4.7 | ~$0.188 | $15/M input + $75/M output, May 2026 list price |
The local-on-owned-card number assumes the card is already paid for, which is the right way to read marginal cost on infrastructure you already own. The amortized cost (which you should also calculate when the card is the reason you are doing the task) is the card’s purchase price divided by the task volume over the card’s expected useful life. A $450 RTX 5060 Ti running 1,000 paper summaries has an amortized cost of $0.45 per summary; running 1,000,000 summaries has an amortized cost of $0.00045 per summary. Volume is the lever.
The published list prices change frequently. Anthropic, OpenAI, and Google all adjusted pricing at least once during the eighteen months ending in May 2026, generally downward for non-frontier tiers and stable to upward for frontier tiers. Use these numbers as April-2026 snapshots, not as forward guarantees. The Artificial Analysis pricing tracker (Artificial Analysis 2026) is the closest thing to a live source the field has.
Reproducibility notes
- The model file:
bartowski/Qwen_Qwen3-8B-GGUF, exact quantization filenames in the table above. - The
llama.cppcommit (bbeb89d) is fixed in the build log alongside the rawllama-benchCSV output. Different commits can shift these numbers by 10 to 30 percent; the same commit on the same hardware should reproduce within run-to-run noise, whichllama-benchreports as standard deviation in thestddev_tscolumn. - The provider, GPU model, driver version, and host CPU all matter. Substituting any of them produces a different number. The trade-off shape across quantization levels is the portable claim; the absolute numbers are not.
- All instance time spent on these benchmarks: approximately one hour. Total spend: under $1.00.
Raw outputs
Saved as appended logs to the build artifacts of the book. The llama-bench CSV (with full column set including build commit, model parameters, batch sizes, and standard deviations) is the canonical source. Summary table above is derived from those rows.