The model does not pick the next word. A separate, tiny piece of code called the sampler does, and every “temperature” or “top-p” knob you have ever touched lives there.
Chapter 2 set up the cleanest seam in the whole system. The model reads your prompt, runs its forward pass, and hands out a ranked list of how likely each possible next word is. Then it stops. Another piece of code, the sampler, picks one of those words and writes it down. That split matters more than it sounds, because everything the model did to produce the ranking, the tokenization, the embedding, the stack of transformer blocks, the final projection to logits [the raw unnormalized scores the model assigns to every possible next token], was deterministic. Fix the input, fix the weights, and the ranking comes out the same both times. The only parts of your API call that can roll the dice are the inputs and the sampler. This chapter is about the sampler.
Four of the knobs you touch most often, temperature, top-p, top-k, frequency penalty, all live in this one layer. So do the most commonly mis-stated claims in public prompt guides. “Temperature 0 is deterministic.” “Higher temperature means more creative.” “Top-p 0.95 is always the right default.” Each of those claims is partly true and partly wrong, in ways that matter once you start measuring.
Keep the running example in mind. You paste a forty-page paper into the chat box and ask for five bullets. The model reads the paper, runs its forward pass, and produces its ranking of what the next word of the summary should be. The sampler picks one word from that ranking. Then the whole loop runs again. Rerun the same prompt and you get a different summary, not because the model thought something different, but because the sampler rolled different dice. Every choice in this chapter, temperature, truncation, penalties, shapes how those dice are weighted.
The sampler is not where the intelligence lives. It is where you choose how to draw from the intelligence the model produced.
9.1 Why the same prompt gives different outputs
Here is the observation. Send the same prompt to the same model twice and you often get different text. Most people’s first theory is that the model “thought of something different this time.” This is not what happened. The spine-line from Section 1.1 still holds: the model reads what came before and produces a ranked guess at the next word. With the inputs and weights fixed, it produces the exact same ranking both times. What changed is the sampler’s draw from that ranking.
Go back to the paper-summary example for a moment. You paste the paper in, ask for five bullets, and the model runs through the whole paper. When it gets to the first word of bullet one, it produces a list that looks something like this:
mat: 0.40
floor: 0.20
chair: 0.10
sofa: 0.05
windowsill: 0.03
... thousands more, trailing off
That list has around 128,000 entries on a modern model, one per possible next token. For comparison, a dictionary of every word in English has on the order of 200,000 entries, so the model is choosing from a vocabulary more or less the size of the Oxford English Dictionary at every single step. The raw scores the network emits are called logits. A small operation called softmax [a function that squashes a list of real numbers into positive values that sum to 1, turning them into probabilities] turns those scores into the probabilities you see in the table above. The sampler then picks one token from the table.
The simplest policy is greedy sampling: always pick the most probable token. That is deterministic. Fix the distribution, and greedy’s output is fixed too. It is roughly what you get when you set temperature = 0 on most APIs. Greedy is the right default for tasks with a single correct answer, a classification label, a function name, a SQL keyword, but it is famously catastrophic for open-ended generation.
Holtzman et al.’s The Curious Case of Neural Text Degeneration (Holtzman et al. 2020) showed why. On any generative prompt longer than a paragraph, greedy decoding produces text that loops on itself, repeating the same phrases and collapsing into degenerate cycles. The model’s ranked list always has a long tail of plausible continuations; greedy throws all of them away every step. Natural-sounding text requires occasionally reaching a little way into the tail and pulling out a lower-probability token.
The fix is stochastic sampling: roll weighted dice over the ranked list (or some truncation of it) and take whichever token comes up. The tradeoff is immediate. More randomness means more diversity, more surprise, and eventually, more incoherence. Less randomness means more coherence, more predictability, and eventually, more repetition. Every knob in this chapter steers that tradeoff.
9.1.1 The “temperature 0 is deterministic” myth
Here is a puzzle. Greedy sampling is deterministic. Many APIs expose a temperature=0 setting that is supposed to trigger greedy. So identical prompts at temperature 0 should always produce identical outputs. They do not. The gap between “should” and “do” is worth naming because it trips up people writing evals.
At the math level, greedy is a deterministic function of the logits. At the serving level, the logits themselves are not byte-identical run-to-run. Three things push them around.
- Floating-point non-associativity. The model’s work is mostly matrix multiplication [a huge grid of numbers multiplied against another huge grid, the main arithmetic a transformer does]. GPUs parallelize that work across thousands of cores, and each core adds up its piece in whatever order the scheduler fires them. In plain arithmetic,
(a + b) + cequalsa + (b + c). In 16-bit floating-point, the two orderings agree to many digits but not to the very last bit. Multiply that tiny rounding difference across billions of operations and the logits at the end of the forward pass can disagree in the low bits between two runs of the same prompt. - Kernel selection. A kernel is the specific piece of GPU code that carries out one of those matrix multiplications. Libraries like cuBLAS [NVIDIA’s standard GPU math library] keep a menu of different kernels tuned for different matrix shapes and pick one based on the batch size, tensor alignment, and what else is running on the GPU. Different kernels agree on the answer to many significant digits but disagree in the last bit. NVIDIA documents this in their cuBLAS reproducibility guide (NVIDIA 2026).
- Mixed-precision rounding and batch effects. Modern serving stacks run the hot matrix multiplications in FP8 or BF16 [low-precision number formats that use 8 or 16 bits instead of 32, trading a little accuracy for a lot of speed] and keep a few sensitive operations in higher precision. Where the precision boundaries sit can depend on how large the batch is and who else is sharing the GPU. Your request running alone produces slightly different logits from the same request running alongside seven other users.
Here is what that actually looks like for your paper summary. Run the summary twice at temperature 0, a few minutes apart. The first run goes out when the GPU is quiet; the second goes out when six other users are hitting the same machine. Inside the network, one layer’s output differs by a tiny fraction in the low bits. That tiny difference propagates forward through the next fifty layers, growing a little at each step, until at some token deep in bullet three the model is weighing two continuations that were nearly tied. In run one, the low-bit difference tips the tie toward “the authors”; in run two, toward “the paper.” From that token on, the two runs diverge. Both summaries remain perfectly reasonable. They are not the same string.
That is the practical upshot. Even at temperature 0, identical API calls can produce different text, especially on long generations where a single low-bit difference in one layer’s output gets amplified until it flips a tie between two nearly equally likely tokens. The quality difference between the two runs is usually small. The string difference is real, and it is why “I set temperature to 0 and my eval is still flaky” is a normal production problem, not a bug in your code.
This is the first place public prompt guides consistently mislead. The fix is simple: treat “temperature 0” as “as deterministic as we can get, given the floating-point physics underneath,” not as a guarantee.
9.2 Temperature
Now the knob itself. Temperature is the dial that controls how sharply or flatly the sampler sees the ranked list the model handed it.
Picture the ranking as a landscape of hills. Tall peaks where the model is confident, low rolling ground across the rest of the 128,000-token vocabulary. Low temperature squeezes the landscape tall and narrow, pinning the draw onto the highest peak. High temperature squashes the landscape flat, letting the dice wander into the valleys. Mechanically, temperature is one line of code inserted between the logits and the softmax:
That division by T is the whole mechanism. Work through what it does to the list.
- T = 1. The logits pass through unchanged. This is the “natural” ranking the model produced.
- T < 1. Dividing by a number less than 1 stretches the logits: large values get larger, small values get smaller in relative terms. The softmax output concentrates on the top few tokens. In the limit T → 0, the ranking collapses onto the single highest-probability token, which is greedy sampling.
- T > 1. Dividing by a number greater than 1 compresses the logits, pulling large and small values toward each other. The softmax output flattens. Low-probability tail tokens get noticeably more mass. In the limit T → ∞, the ranking approaches uniform, and the sampler picks tokens at random from the entire vocabulary.
That is the full mathematical story. Everything else is folk doctrine attached to it.
9.2.1 What the folk doctrine says
The common recipe, repeated in countless blog posts and API tutorials, goes something like:
Lower temperature for factual / analytical tasks. Higher temperature for creative tasks.
0.0for code and math,0.7for chat,1.0for creative writing.
There is a grain of truth here. At low temperature, the sampler collapses onto the model’s most confident prediction, which on a well-specified prompt tends to be the answer the model “means.” At moderate temperature, you introduce variation from the ranking the model already produced, which preserves coherence while allowing variety. At high temperature, you flatten the ranking and start drawing from tokens the model itself considered unlikely, and output quality degrades.
The grain-of-truth reading is:
- Temperature trades off determinism for diversity at fixed quality.
- Beyond some model-specific ceiling (often around 1.2 to 1.5 on well-aligned 2026 models), quality collapses because you are now sampling from tokens the model’s own confidence said were implausible.
- Temperature does not add information. The model emits the same ranking. You are choosing how far into its tail to reach.
9.2.2 What the folk doctrine overstates
The overstatement is the implication that low temperature makes models more accurate. Renze & Guven’s 2024 study (Renze and Guven 2024) ran a careful sweep across nine models and five prompting techniques on multiple-choice problem-solving tasks (MCQA [multiple-choice question answering, where the model picks from a fixed list of candidate answers]), varying temperature from 0.0 to 1.0. They found no statistically significant effect on accuracy across that range. Same problems, same models, same prompts, different temperatures, same scores within noise.
This is a real result and it deserves to hit your intuition. The folk model says temperature 0 should beat temperature 0.7 on a math problem because the model should pick its “best” answer. Renze shows that for single-sample MCQA, this effect is smaller than the run-to-run noise. If there is a temperature effect on reasoning, it is too small to measure on a realistic eval budget.
Two caveats keep this from being a total takedown of temperature folk doctrine. First, the Renze study is on multiple-choice problems with a fixed answer set. On generative tasks (proofs, code, long-form analysis), the story is less clean; higher temperature can derail long outputs in ways the MCQA setup cannot see. Second, the story changes at the multi-sample level. If you generate twenty completions at temperature 0.7 and take the majority vote, you get a result (self-consistency, Wang et al. (Wang et al. 2023)) that a single temperature-0 sample cannot reach. Zhang et al. 2025 (Zhang et al. 2025) formalizes the task-adaptive temperature question. The right temperature depends on whether you are doing single-sample inference or drawing many samples and aggregating, and on the verifier you have at the end.
For the paper-summary task, this matters in a specific way. If you are producing one summary and reading it once, the difference between temperature 0 and temperature 0.7 is mostly a difference in phrasing, not in whether the summary gets the paper right. If you are producing twenty summaries and having a second pass pick the best one, moderate temperature is a feature, because the variety between the twenty is what lets the second pass find one that covers the paper better than any single low-temperature attempt would.
9.2.3 Useful operating ranges
Given the above, here is the operating range that tracks the 2026 literature and my own measurements on open-weight models:
- T = 0.0 (greedy / near-greedy). Extraction, classification, tool calls, code with a single correct answer, evaluation runs you want to rerun cleanly. You accept rerun variance from serving non-determinism but not from the sampler itself.
- T ≈ 0.3–0.7. Conversational assistants, customer support, factual Q&A with some tolerance for paraphrase. The default for most chat interfaces sits near 0.7, and it is a reasonable default.
- T ≈ 0.7–1.0. Creative writing, brainstorming, ideation, anywhere a single “right” answer does not exist. Above 1.0, you enter the region where output quality becomes noticeably model-dependent.
- T > 1.0. Handle with care. Some models (Qwen 2.5 and later, Llama 4) stay coherent up to around 1.3. Others, especially smaller open-weight models, start to slip earlier. Past ~1.5, almost nothing stays coherent.
One important interaction: temperature does not act alone. How it feels in practice depends on what truncation method it is paired with. A temperature of 1.2 with aggressive top-p truncation behaves differently from 1.2 with loose truncation. Nguyen et al. (Nguyen et al. 2025) show this cleanly. The “temperature is too hot above 1.0” folklore is partly an artifact of top-p being the wrong companion at high temperature. The next section is where that argument lives.
For a worked first-party temperature sweep on Qwen3-8B at Q4_K_M, see Appendix 3. Two prompts are run at five temperatures each: a constrained summarization prompt where the output barely moves across the full range, and an open-ended creative prompt where the variance the textbook description predicts actually shows up.
9.3 Top-k, top-p, and min-p
Temperature reshapes the ranking. Truncation throws tokens off the end of the ranking before the sampler draws. Both happen, in that order, on every single decode step. The three mainstream truncation methods are top-k, top-p (also called nucleus sampling), and min-p. They differ in what they pay attention to, and the difference matters more at high temperature than at low temperature.
Back to the paper summary. The model has just produced its ranked list for the next word of bullet two. Temperature has reshaped the list. Now truncation asks: which tokens are even allowed to be drawn, and which get struck off entirely? The three methods answer that question three different ways.
9.3.1 Top-k
Top-k is the simplest. Keep the k highest-probability tokens, throw the rest away, and sample from those, renormalized. Pick k = 40 and the sampler draws from whichever 40 tokens the model ranked highest that step. Out of 128,000 possible next tokens, 39,960 get struck from the ballot.
Top-k is the easiest truncation to reason about, and its weakness is that k is a fixed cutoff blind to the shape of the ranking. At one position in the summary the model might be highly confident: 90% of the mass sits on the top three tokens, a long thin tail trails off behind them. k = 40 at that position lets in 37 tokens the model considered implausible. At a different position the model is uncertain: mass is spread evenly across dozens of equally good options, none of them dominant. k = 40 at that position might cut off tokens that still carry real probability. One cutoff, two different failure modes.
Top-k still has uses (debugging, enforcing hard upper bounds on diversity, pairing with low temperature for constrained outputs), but it is rarely the right default on its own.
9.3.2 Top-p (nucleus sampling)
Top-p, introduced by Holtzman et al. in 2020 (Holtzman et al. 2020), adapts to the shape of the ranking. Instead of taking a fixed number of tokens, it takes the smallest set of tokens whose probabilities add up to at least p, then samples from that set renormalized.
Walk through how this behaves on the paper summary. At a word where the model is confident, say choosing the first noun of bullet one and “paper” already has 80% of the mass with “authors” carrying another 15%, top-p with p = 0.95 passes through just those two tokens. Everything else drops off the ballot. At a less committed word, say choosing an adjective mid-sentence where the top token has 20% of the mass and twenty other tokens each carry 3 to 5%, top-p with the same p = 0.95 might keep fifteen or twenty of them. The cutoff adapts to how spread the ranking is. That adaptive behavior is what made top-p the dominant default for open-ended generation from 2020 onward.
Top-p at p = 0.9 or 0.95 with temperature at 0.7 to 1.0 has been the standard “creative generation” setting for five years. It is the default on most chat APIs. It is what the SDK gives you when you do not set anything. For 2020-era models, this was defensible. For 2026 instruction-tuned frontier models, it is starting to creak.
9.3.3 Min-p
Min-p (Nguyen et al. 2025) is the most recent entry, and in my reading of the 2025 literature it is the right default going forward. It truncates by a different criterion: keep the tokens whose probability is at least p × max_probability, where max_probability is the probability of the top token at that step. With p = 0.05, the sampler keeps every token carrying at least 5% of the top token’s probability.
Notice what this does to different rankings.
- Sharp ranking (model confident, top token has 80% of the mass). Min-p with
p = 0.05keeps tokens with at least 4% probability. Only a few tokens qualify. The truncation is tight. - Flat ranking (model uncertain, top token has 15% of the mass). Min-p with
p = 0.05keeps tokens with at least 0.75% probability. Many tokens qualify. The truncation is loose.
Min-p scales to the model’s confidence. That is the property top-p misses: a fixed cumulative-probability cutoff is not a stable measure of “the reasonable set” because the shape of “reasonable” changes with context. The Nguyen paper (ICLR 2025 Oral) makes the case empirically, showing min-p matches or beats top-p on quality metrics and, crucially, stays coherent at temperatures where top-p falls apart. This means you can genuinely use temperatures above 1.0 for creative work without the sampler degenerating, because min-p is pruning the truly implausible tokens that higher temperature would otherwise let through.
Min-p is shipping in Hugging Face Transformers, vLLM, and llama.cpp as of 2025. It is not yet the default on OpenAI, Anthropic, or Google APIs as of April 2026, but it is exposed on most open-weight serving stacks.
9.3.4 Historical footnote: Mirostat
Mirostat (Basu et al. 2021) is worth naming briefly for context. It is a 2021 decoding method that targets a specific perplexity [a number that measures how “surprised” the model is by each next token on average; low perplexity means the text is predictable, high means it is varied] rather than a truncation parameter, using feedback control to keep surprise constant across a generation. It has a devoted following in creative-writing circles, particularly for long-form fiction. Most modern serving stacks no longer include it, and min-p has largely absorbed the use case. Worth knowing as an influence on the space, not something I would reach for new.
9.4 Frequency and presence penalties
Two more knobs that show up on OpenAI-derived APIs: frequency penalty and presence penalty. Both modify the logits before sampling to reduce repetition. They target the same failure mode, the model generating the same phrase over and over, which is mostly an artifact of 2020-era greedy-ish decoding and a training distribution that rewards predictable completions.
9.4.1 Mechanics
The definitions, from the OpenAI API reference (OpenAI 2026):
- Frequency penalty. For each token that has already appeared in the output, subtract
frequency_penalty × countfrom its logit, wherecountis how many times the token has appeared. Scales with usage: a token that appeared once gets a small nudge; a token that appeared ten times gets a big nudge. - Presence penalty. For each token that has appeared at all in the output (count ≥ 1), subtract
presence_penaltyfrom its logit. Binary: appeared or did not, no scaling with frequency.
Both operate on the logits before the softmax, before any temperature division, before any truncation. The effect is to make repeated tokens less likely without eliminating them.
9.4.2 When they help
Penalties were built for models and sampling regimes where degenerate repetition was a real and common failure mode. That era is mostly behind us, but not entirely.
- Long-form generation on smaller models. A 7B-or-below model asked to write a thousand-word story can get stuck repeating a phrase every few sentences. A small presence penalty (0.3 to 0.6) reduces this.
- Structured lists and enumerations. When you want the model to produce, say, ten distinct suggestions, a presence penalty helps keep items from echoing each other. A frequency penalty can push harder.
- Specific known-bad tokens. If you see the model reliably overusing a particular word or phrase (brand-specific jargon, say), a penalty on that word is a blunt but workable fix.
9.4.3 When they hurt
Here is the failure mode that keeps showing up in production code, and it is worth walking through end to end because it looks innocent before it breaks.
Imagine your paper-summary pipeline produces JSON instead of bullets. You want the model to emit something like:
You turn on frequency_penalty = 0.5 because you read somewhere that penalties reduce repetition. The model now gets penalized every time it emits ", every time it emits {, every time it emits a field name like "authors" or "methodology" that appears more than once across many documents. By the third or fourth field, the logit for " has been pushed down far enough that the model quietly starts emitting something that is not quite a double quote. Your downstream JSON parser throws. The summary looks fine in a raw-text log. The pipeline silently fails on every tenth request.
That is the most common way I have seen penalties quietly break production code. A frequency penalty on a schema [a fixed template the model is supposed to fill in, usually JSON with specific field names] where certain tokens appear many times will actively suppress the tokens you need. The other two failure modes follow a similar shape:
- Penalize natural repetition. Some tokens are supposed to repeat. Common words, structural tokens in code, section headings. Penalizing them produces awkwardly-phrased output that reads “off” without any single obvious problem.
- Fight with the model’s own alignment. Modern preference-aligned models are trained to avoid degenerate repetition already. Stacking a penalty on top tries to fix a problem that has already been fixed, and it usually introduces new problems instead.
The Anthropic sampling docs (Anthropic 2026) are more restrained about penalties than OpenAI’s are, which I read as appropriate. They are legacy knobs, and their best-practice defaults on current models are zero.
9.5 When deterministic, when creative
The practical upshot of the whole chapter is a regime-selection question. What sampling configuration do you want for this workload? Treat the following as the starting point and tune from there. The paper-summary task threads through each one, because the regime you pick changes what the summary looks like.
9.5.1 Deterministic regime
Settings: temperature=0, greedy decoding (or tight top-k = 1). No other truncation needed. Penalties at 0.
Workloads: structured outputs, tool calls, classification, code generation with a single correct answer, SQL generation, function-name prediction, extraction tasks, evaluation runs. Anywhere there is a canonical right answer and you want the model to commit.
What to expect: reproducibility at the math level, rerun variance at the serving level (floating-point non-determinism, Section 9.1). Plan your evals around the serving variance, not around the assumption that temperature 0 = bit-identical outputs.
The paper in this regime: the summary comes back the same across reruns, modulo the serving-layer low-bit drift. If you want a reproducible extraction you can diff against last week’s version, this is the regime.
9.5.2 Conversational regime
Settings: temperature=0.5 to 0.7, top-p = 0.9 or min-p = 0.05 (prefer min-p where available), penalties at 0.
Workloads: chat assistants, customer support, Q&A over a knowledge base, general-purpose agents doing ordinary work. The vast majority of production LLM calls.
What to expect: responses that feel natural, modest rerun variance, low risk of degenerate output. This is the boring-default regime, and most operators should stay in it unless they have a specific reason to move.
The paper in this regime: the bullets feel natural and vary a little on rerun. Fine if you are reading the summary once and moving on.
9.5.3 Creative regime
Settings: temperature=0.8 to 1.2, min-p = 0.05 to 0.1 (top-p = 0.95 is workable but fragile at the upper end of this temperature range), penalties at 0.
Workloads: creative writing, brainstorming, ideation, marketing copy variations, anywhere a single “best” answer does not exist and diversity is genuinely useful. The higher end of this range only works well with min-p; if your platform does not expose min-p, cap temperature at around 1.0.
What to expect: substantial output diversity run-to-run, occasionally surprising continuations, some risk of the model going off-prompt on longer generations. The right tradeoff when you want variety and have a human in the loop to filter.
The paper in this regime: richer paraphrases, occasionally surprising framings, the kind of output you would want if you were drafting a blog post from the paper rather than diffing a field against last week.
9.5.4 Multi-sample regime
Settings: temperature=0.7, min-p = 0.05, generate k completions (typically k = 5 to 20), aggregate with a verifier or majority vote.
Workloads: reasoning tasks where correctness can be verified (math with numerical answers, code that can be tested, extraction with schema validation). The canonical technique is self-consistency (Wang et al. 2023): generate multiple chain-of-thought samples, take the majority vote on the final answer.
What to expect: higher accuracy than any single-sample configuration at matched temperature, at the cost of k-fold inference cost. This is how the research literature turns “sampling diversity” from a bug into a feature. The sampler’s randomness becomes a source of exploration, and the verifier at the end collapses the exploration back to a single answer. Zhang et al. (Zhang et al. 2025) formalizes the task-adaptive choice of k and temperature.
The paper in this regime: you generate ten summaries, then ask a second pass (or yourself) to pick the one that covers the paper best. Same paper, same model, ten chances, pick the winner.
9.5.5 Tool-call regime
Settings: temperature=0, no truncation (irrelevant at temperature 0), penalties at 0.
Workloads: every tool call, every function call, every time the model is emitting a structured request for the application to execute. The vendor recommendation is unanimous across Anthropic, OpenAI, and Google: tool calls at temperature 0. This is correct, both for the usual “single right answer” reason and because tool-call schemas are exactly the case where frequency penalties would destroy structural tokens (see the penalty discussion in Section 9.4).
What to expect: reliable schema compliance, deterministic tool invocation, reruns that reproduce cleanly. Couple this with the reasoning-fields-first schema pattern from Chapter 8 to preserve reasoning quality inside the structured output.
The paper in this regime: if the summary is itself a tool call (the model is emitting the bullets into a structured JSON handed off to the next step in a pipeline), you are here, not in conversational.
The through-line of all five regimes: the model always produces the same ranking. You choose how to draw from it. Determinism for extraction, exploration for generation, sampled-and-verified for reasoning, and a deliberately boring middle for chat. Once that frame is in hand, the rest of the sampler stops being folklore and starts being engineering.
9.6 Two correctives worth keeping
The model produces a ranking; the sampler picks a token; temperature and truncation are the two knobs that steer the pick; penalties are legacy knobs that usually should stay at zero; the right settings depend on what you are trying to accomplish.
Two specific correctives are worth remembering, because they contradict the public prompt-guide folklore most operators absorb first:
- Temperature 0 is not bit-deterministic in production serving. It is math-deterministic, subject to the floating-point physics of GPU kernels. Design your evals around that.
- Lower temperature does not make a single sample more accurate on reasoning. It makes it more reproducible. If you want higher accuracy, pair a moderate temperature with multi-sample inference and a verifier. Self-consistency is cheap compared to any other accuracy lever at this part of the stack.