An LLM reads a list of words, scores every possible next word, picks one, and does it all again. That is the whole mechanism.
Here is the working model in one technical sentence:
An LLM is a function from a sequence of input tokens to a probability distribution over output tokens, parameterized by billions of weights learned by predicting the next token in a corpus of human text.
Every noun in that sentence is load-bearing. Function. Sequence of tokens. Probability distribution. Billions of weights. Next-token prediction. Corpus of human text. Unpack any one of them sloppily and the rest of the book stops working. You end up debugging at the wrong layer, arguing that the model “knows” something it cannot know, or reaching for a tool that does not apply.
3.1 A function from tokens to tokens
Type “The cat sat on the” into an LLM and it finishes the sentence. Maybe “mat.” Maybe “floor.” Maybe “sofa.” Something that has followed those five words in the training data often enough to rank high when the model scores every possibility for what comes next.
That is the entire behavior. Everything else in this book is mechanism around it. To see what the model does, you have to see what it sees.
3.1.1 Tokens: the alphabet the model reads
The model does not see letters. It does not quite see words either. It sees tokens [chunks of text the model has learned to treat as single units during training].
“The cat sat on the” is probably five tokens. One per common word, with the leading space tucked into each one:
["The", " cat", " sat", " on", " the"]
Common words map to single tokens. Rare words get split into pieces. “Antidisestablishmentarianism” might become four or five tokens. “Foo” in a code block might be a single token because foo is a famous placeholder; “Fxxx” in the same context might be four tokens because the model has never seen that string before.
The fun consequence: the model genuinely does not know how letters work. When ChatGPT confidently says “strawberry” has two R’s, it is not making an arithmetic mistake. It never saw the letters. It saw the tokens straw and berry, or possibly strawberry as one chunk, and it is guessing at a letter-level question by remembering what it once read about that word. The R’s are not in its input.
Tokens are the model’s alphabet. Every time this book says “the model processes text,” what happens under the hood is that text turns into a list of integer token IDs, and those integers are what the model works on. Modern tokenizers carry vocabularies between 30,000 and 200,000 entries wide, roughly the size of a mid-sized unabridged English dictionary, plus all the sub-word fragments needed to handle the rest. The vocabulary is built once, before training, by byte-pair encoding (BPE) (Sennrich et al. 2016). BPE starts from raw characters and greedily merges whichever pair of adjacent characters appears together most often, repeating until the vocabulary hits its target size. Karpathy’s hands-on walkthrough (Karpathy 2024) builds BPE end to end in an afternoon.
3.1.2 The output: a probability over every token, at every step
When the model finishes “The cat sat on the”, it does not commit to mat. It produces a score for every token in its vocabulary. Those scores, called logits [the raw, un-normalized scores the model assigns to every possible next token], are then passed through a softmax [an operation that turns a list of arbitrary numbers into a list of probabilities that sum to 1]:
mat: 0.40
floor: 0.20
chair: 0.10
sofa: 0.05
windowsill: 0.03
... and thousands more, trailing off toward zero
The output is a distribution over the whole vocabulary, not a single choice. The model “thinks” mat is most likely, is quite open to floor, and is not shocked by windowsill.
What you see in a chat interface is one specific token drawn from that distribution by a separate program called the sampler (a full chapter of its own in Chapter 9). The sampler might pick the most likely token every time. It might roll weighted dice. The key point for this chapter is that the distribution is the model’s actual output, and the sampler is a choice layered on top. Two different samplers can produce wildly different text from the same model.
This is the single most-elided point in public explanations of LLMs. The distribution is the model. Everything above it (greedy decoding, nucleus sampling, temperature, top-k, beam search) is a policy for turning the distribution into text. Policies are cheap to change and can be swapped per request. The distribution is expensive (a full forward pass) and is fixed by the weights. When someone says “the model is creative at temperature 1.2,” they mean the sampler is drawing from the tails of a distribution the model already produced. Holtzman et al.’s paper on why greedy decoding produces degenerate text (Holtzman et al. 2020) is the canonical reference.
Apply this to the forty-page paper. When you paste it in and ask for five bullets, the model does not “decide” its bullets in one shot. At every position, it produces a distribution over its whole vocabulary and the sampler picks one. The fluency of the summary is a property of the distribution; the specific wording is the sampler’s draw. Two runs of the same request can hand you two differently-worded summaries that are both valid samples from nearly the same distribution.
3.1.3 The function
A mathematical function is a rule from inputs to outputs. The LLM is such a rule: a sequence of tokens goes in, a probability distribution over the next token comes out. Fix the input tokens and fix the weights, and the function produces the same output distribution every time. No hidden state. No memory of prior requests. No mood. Monday or Friday: same inputs, same weights, same output.
The variability you see in practice comes from two places, both either upstream or downstream of the function itself:
- Inputs change. A new retrieved document got added to the context. The system prompt got updated. The user phrased a follow-up slightly differently. The conversation history is five turns longer.
- The sampler changes. Temperature went up. A different random seed was used. The sampling policy was switched.
That is the whole story for why a prompt produces different answers on different runs. The function underneath is rigidly deterministic. What changes is what you feed it, or how you pick from what it outputs.
This determinism matters more than it sounds. It makes evals work. It turns “I changed one word in my prompt and now my output is totally different” into a fact about the inputs, not about the model being moody. It makes quantization measurable. And it makes the rest of this book’s approach possible: if you understand what goes into the function and what comes out, you have a complete picture of where behavior lives. There is no unobservable layer in between.
The remaining question is where the function came from. The weights did not fall out of the sky. They were learned by a specific training procedure on a specific kind of data, and that procedure shapes what the function does.
3.2 Where the weights came from
The second phrase from the one-sentence model: parameterized by billions of weights learned by predicting the next token in a corpus of human text.
A quick vocabulary detour:
- A vector is a point or direction in a space with several dimensions. You write it down as a list of numbers giving its coordinates, but the thing itself is geometric.
[0.5, 1.2]is a two-dimensional vector: picture a point on a map, 0.5 units east and 1.2 units north of the origin.[0.1, -0.3, 0.5, 2.7]is a four-dimensional vector. You cannot picture four dimensions, but the math works the same way; it is a specific spot in a four-dimensional space. The reason vectors matter for LLMs is that you can add, subtract, and scale them, and the arithmetic corresponds to real movement around the space. When Ch 2 noted that king - man + woman ≈ queen, it was doing arithmetic on vectors, and the arithmetic meant something geometric: start at “king,” walk in the direction of “maleness” in the reverse, then walk in the direction of “femaleness,” and you arrive near “queen.” The list of numbers is the description; the space is the thing. - A matrix is a grid of numbers, rows and columns. If you have used a spreadsheet, you have used a matrix. Its job is to move vectors around: multiply a vector by a matrix and the vector lands somewhere else in the space, stretched or rotated or squeezed, but still a point.
- A neural network is a stack of those matrix moves glued together with simple non-linear kinks. Each layer takes the vectors the previous layer produced and pushes them into new positions. Feed an input vector in, watch it tumble through the stack, and out the other side comes an output vector. The network’s weights are the numbers inside all those matrices. Training is nudging those numbers until the output lands where you want.
- A transformer is the specific kind of neural network all modern LLMs are built on, named after the 2017 paper that introduced it (Vaswani et al. 2017). Its distinguishing move is attention, which we unpack in Section 3.3.
Weights are numbers. A modern frontier model has between a billion and a trillion of them, arranged in the large matrices that make up the transformer. “Parameterized” means the model’s behavior is entirely determined by what those numbers happen to be. Change one weight and you change (slightly) what the model says next.
The weights come from training: adjusting them until the model is good at predicting text. Training happens once, before anyone uses the model, and produces a fixed set of weights that ship with the model. When you hear “Claude Opus 4.7 was released April 16, 2026,” it means a specific set of weights finished training on that date and was made available. Those weights do not change as users chat with the model.
Modern frontier training happens in three stages, each with different goals and different data. You will not train a frontier model yourself (each run costs tens of millions of dollars and takes months on thousands of specialized chips), but every prompting technique in Part II reaches into one of these three stages. Knowing the shape of each helps you figure out which prior you are pushing against when a prompt does not work.
3.2.1 Pre-training
The first stage is the expensive one. A transformer is initialized with random weights and shown a massive corpus of text, trillions of tokens drawn from the web, books, code, scientific papers, and whatever else the lab can legally and practically assemble. At each position in each document, the model is asked to predict the next token. When it gets the prediction wrong, backpropagation adjusts every weight a tiny amount to make the correct answer more likely next time. Repeat trillions of times and the model gradually builds up statistical patterns of language, facts, and reasoning-shaped behavior.
Three words get thrown around in descriptions of this process:
- Loss function. A number that measures how wrong the prediction was. For language models, the loss is usually cross-entropy: roughly, “how different is the model’s probability distribution from a distribution that gave all the mass to the correct next token?” Lower loss means better prediction.
- Gradient descent. The algorithm that turns a loss value into weight adjustments. Compute how each weight contributed to the error. Nudge each weight in the direction that would have reduced the error. Repeat.
- Backpropagation. The technique for computing those per-weight contributions efficiently, by sending error signals backward through the model’s layers.
Those three terms define the entire act of machine learning. When you hear “the model was trained for two months on five trillion tokens,” what happened under the hood is that loss was computed roughly five trillion times, gradients were computed roughly five trillion times, and weights were nudged roughly five trillion times. Five trillion tokens is in the same ballpark as every book in the Library of Congress, read end to end, once through.
Two results from pre-training shape how to think about the resulting model. Kaplan et al. (Kaplan et al. 2020) showed that loss falls as a smooth, predictable function of parameters, data, and compute: the scaling laws. Hoffmann et al.’s Chinchilla paper (Hoffmann et al. 2022) sharpened the picture. For a fixed compute budget there is an optimal ratio of model size to training tokens, and the useful shorthand is “about twenty training tokens per parameter.” Frontier training often exceeds that ratio when more data is available. The operator lesson: raw model size is not a quality signal on its own. A 70B model trained on a good dataset often beats a 400B model trained on a mediocre one.
A warning worth applying here: when a lab announces a model “trained on N trillion tokens” as a capability claim, ask who benefits from the number being true, whether the tokens were useful or just available, and whether you have any independent way to check (Sagan 1995). Parameter counts and token counts are the marketing substrate of this industry. They are not capability measurements.
What comes out of pre-training is called a base model. It is good at completing text. Give it “The cat sat on the” and it confidently produces ” mat.” It is not yet good at following instructions. Prompt a raw base model with “Summarize the following article:” and it will often hand you back another plausible instruction, because in the training data, “Summarize the following article:” is more often the start of a list of instructions than the start of an actual summary. Paste the forty-page paper into a raw base model and you might get five bullets. You might also get a continuation of the paper in the same academic register, or a list of further questions about the paper. The base model does not yet know which of those outputs you want.
3.2.2 Supervised fine-tuning
The second stage, supervised fine-tuning (SFT), fixes the instruction-following problem. A team of humans assembles a dataset of instruction/response pairs, tens to hundreds of thousands of examples, each written or heavily edited by people who know what a good response looks like. The model is trained to produce the response whenever it sees the instruction.
SFT is cheaper than pre-training. Weeks rather than months. Thousands of GPU-hours rather than millions. Data quality dominates. You are demonstrating good behavior thousands of times in a concentrated dataset so the model learns to stay on task.
SFT is also the layer where most format behavior comes from: whether the model defaults to markdown, how it structures lists, whether it uses headings, when it drops into code blocks. That behavior was demonstrated into the model, not reasoned out from scratch. When your system prompt says “respond in JSON” and the model sometimes slips back into prose, you are watching SFT priors reassert themselves. Chapter 6 picks this up.
SFT is also why your assistant hands back five bullets when asked to summarize a paper. A base model does not know that bullets are the shape you want. An SFT’d model does, because it saw thousands of (paper, five-bullet-summary) pairs in the curated dataset.
3.2.3 Preference alignment
The third stage is where the picture gets plural. An SFT’d model follows instructions but does not reliably prefer the response a careful human would prefer. Given two plausible continuations, SFT does not teach which one is better. The fix is to train on preferences rather than demonstrations.
The original method is Reinforcement Learning from Human Feedback (RLHF) (Christiano et al. 2017), adapted to language models by the InstructGPT team (Ouyang et al. 2022). It is the reason ChatGPT worked the way it did at launch. The mechanics:
- Collect preference data. Humans see pairs of model responses to the same prompt and pick which one is better. This produces a dataset of “response A is better than response B for prompt X.”
- Train a reward model. A separate neural network learns to take a prompt and a response and output a number predicting how humans would rank that response.
- Fine-tune the language model against the reward model. The language model is trained using a reinforcement learning algorithm called PPO (Proximal Policy Optimization), which nudges the model toward outputs the reward model scores highly without letting it drift too far from the SFT starting point.
The full setup is genuinely complicated. You are training a reward model, stabilizing a reinforcement learning algorithm that likes to collapse, and watching for reward hacking: the model learning to exploit quirks in the reward model rather than improve.
Two successors matter. Rafailov et al.’s Direct Preference Optimization (DPO) (Rafailov et al. 2023) skips the explicit reward model and the reinforcement learning loop entirely. The same preference signal gets baked into a single supervised-style objective whose gradient matches RLHF’s in the limit. DPO is simpler, trains more stably, and has become the default at many labs for new projects.
In parallel, Anthropic’s Constitutional AI (Bai et al. 2022) replaced most human preference labeling with AI feedback against a written “constitution” of principles (“be helpful,” “avoid harm,” “do not deceive”). A strong model ranks pairs of responses against the principles. This scales labeling, makes the alignment criteria auditable, and is how Claude’s preference alignment is produced.
These approaches solve overlapping but not identical problems. RLHF is the most battle-tested method at frontier scale. DPO is simpler, catches up fast, and is the more sensible starting point for a new project unless you have a specific reason (reward-model reuse, multi-objective optimization, online learning) to prefer RL. Constitutional AI is less about the algorithm and more about what the preferences encode. It stacks with either method.
For an extended treatment, Lambert’s Reinforcement Learning from Human Feedback book (Lambert 2026) is the current canonical reference.
Operational summary. By the time a model reaches you via an API, it has been pre-trained to know statistical patterns of text, SFT’d to follow instructions, and preference-aligned to produce responses a labeling process rewarded. Each stage is a lever someone else pulled before you got the model. Your prompting, retrieval, and tool-use decisions reach into those priors. System prompts lean on SFT (Chapter 6). Few-shot examples lean on in-context learning, which emerges primarily from pre-training (Chapter 7). Requests to refuse or abstain lean on preference alignment. Back to the paper summary: pre-training is why the words sound like a summary of a paper. SFT is why you got bullets instead of a paraphrase of the opening paragraph. Preference alignment is why the tone is helpful rather than contemptuous of a reader asking for a shortcut.
3.3 The inference pipeline end-to-end
Now the money section. Walk one prompt through the full pipeline, concretely.
The prompt is five words:
The cat sat on the
Six stages happen between your HTTP request and the first token that comes back. The forty-page paper is the same pipeline with a longer input.
3.3.1 1. Tokenization
The string enters the tokenizer, which runs its merge table over the bytes and produces a sequence of integer IDs. For GPT-style tokenizers, this specific prompt lands as roughly:
Every model family has its own tokenizer and its own IDs. What matters is that after this step, the prompt is no longer text. It is a list of integers, each pointing to a row in a large lookup table. This is where prompt-length accounting happens: “tokens” on a pricing sheet means these integers, not words (Sennrich et al. 2016; Karpathy 2024). The forty-page paper becomes roughly 25,000 integers at this step.
3.3.2 2. Embedding
Each token ID is looked up in a large matrix called the embedding table, a spreadsheet with one row per token in the vocabulary and a few thousand columns per row. For a model with a 128,000-token vocabulary and a hidden dimension [the size of the vectors the model works with internally] of 8,192, this table is a 128,000 × 8,192 matrix. Roughly a billion numbers on its own.
The lookup for our five tokens produces five vectors, each with 8,192 numbers:
These vectors are the model’s initial view of each token. They are dense (no zeros), learned during pre-training, and tokens with similar meanings end up with vectors close to each other in the 8,192-dimensional space. Positional information (which token came first) is added on top, either as additional learned vectors or through rotary position embeddings that rotate each vector in pairs according to the token’s position. At this stage each vector represents the model’s opinion about this token on its own, before any cross-token reasoning has happened. The 25,000 tokens of the paper land here as 25,000 points in an 8,192-dimensional space.
3.3.3 3. The transformer stack
Now the heart of it. The embedding vectors flow through a stack of N transformer blocks. Frontier models typically have between 32 and 120 of them. Each block has two main components: self-attention and a feed-forward network (FFN), wrapped in residual connections and layer normalization that keep training stable but do not change the story.
Attention is the mechanism that makes this architecture work for language, so it is worth building intuition without the equations (Vaswani et al. 2017).
At each position in the sequence, the model produces three vectors: a query, a key, and a value. Each comes from multiplying the incoming hidden state by a learned projection matrix. Informally:
- The query is what this position is asking about.
- The key is what each other position offers.
- The value is the information each position would contribute if asked.
To compute the output for a given position, the model compares that position’s query against every earlier position’s key, runs the match scores through softmax, and uses the results as weights on the earlier positions’ value vectors. Softmax is the same “turn numbers into a probability distribution” operation we saw earlier.
The consequence: every position’s output is a mixture of information from every earlier position it attended to. By the time our five-token prompt reaches the top of the stack, the representation at the last position (“the”) has pulled in information from “The cat sat on the”. The role of “cat” as subject, “sat” as past-tense verb, “on” as a preposition expecting a location, all of it has been mixed in. The feed-forward network after attention then does per-position non-linear processing on that aggregated representation.
For the paper summary, the same mechanism is what lets the model pull a sentence from page three and a sentence from page thirty-nine into the same final hidden state. When Ch 2 described eyes flicking across the paper as the intuition for attention, this is the mechanism underneath. There is no separate “look back at the paper” module. Attention is the model looking back.
Modern models enrich this basic picture in ways that matter for Part V but not yet here. Grouped-query attention trims the key and value projections to save memory bandwidth. Multi-head latent attention compresses the key and value cache into a lower-dimensional latent space. Mixture-of-experts replaces the feed-forward network with a routed set of smaller expert networks, only a few of which activate per token. All keep the same interface: a hidden state goes in, a hidden state comes out, attention mixes across positions.
3.3.4 4. Final projection to logits
After the last block, the final hidden state at the last position (a vector of length 8,192) gets multiplied by an output matrix of shape [8192, 128000]. In most modern models this matrix is tied to the embedding table (same weights, transposed) to save about a billion parameters. The output is a vector with one real number per vocabulary entry, one score per possible next token:
After our prompt, the distribution concentrates on ” mat” and its neighbors (” floor”, ” windowsill”, ” bed”, ” roof”). The model’s entire opinion about what comes next is this vector. No single token has been chosen yet.
3.3.5 5. Sampling
The distribution becomes a token via a sampler. The sampler is a policy, not part of the model. Common choices:
- Greedy sampling. Pick the highest-probability token. Deterministic, reliable for tasks with a single correct answer, prone to degenerate loops on open-ended generation (Holtzman et al. 2020).
- Temperature scaling. Divide the logits by T before softmax. T below 1 sharpens the distribution; T above 1 flattens it; T = 0 is equivalent to greedy.
- Top-k. Truncate to the k most probable tokens, then sample from those.
- Top-p (nucleus sampling). Truncate to the smallest set of tokens whose cumulative probability exceeds p, then sample from that set (Holtzman et al. 2020). The canonical default for creative generation.
One token comes out.
3.3.6 6. The autoregressive loop
That token gets appended to the input sequence. The whole process runs again, now on six tokens instead of five, to produce the next one. And again, and again, until a stop token is generated or a maximum length is reached.
In Python-flavored pseudocode, the whole pipeline fits in about fifteen lines:
def generate(prompt: str, max_new_tokens: int = 50) -> str:
token_ids = tokenizer.encode(prompt) # 1. tokenize
for _ in range(max_new_tokens):
x = embedding_table[token_ids] # 2. embed
for block in transformer_blocks: # 3. transformer stack
x = block(x) # (attention + FFN)
logits = x[-1] @ output_weight.T # 4. project last position
probs = softmax(logits / temperature)
next_id = sample(probs) # 5. sample
if next_id == STOP_TOKEN:
break
token_ids.append(next_id) # 6. loop
return tokenizer.decode(token_ids)Serving systems do not re-run steps 2 and 3 for the whole sequence on every iteration. The intermediate key and value vectors from every attention layer are cached across steps in the KV cache; only the newest token’s contribution is computed fresh. PagedAttention (Kwon et al. 2023) is the current state of the art for managing that cache in multi-tenant serving (Section 4.2). The semantics of the loop are exactly what the pseudocode shows: embed, stack, project, sample, append, repeat.
Two consequences worth naming. Generation latency is dominated by the decode loop (steps 2 through 6, once per output token), not by the prefill (the first parallel pass over the prompt). Prefill is compute-bound; decode is memory-bandwidth-bound (Section 4.3). And the sampler is the only place where randomness enters. Fix the seed, fix the prompt, and everything is reproducible. That reproducibility is what makes evals possible (Chapter 16).
3.4 The context window
The last phrase of the one-sentence model is sequence of input tokens. How long can the sequence be?
The context window is the maximum number of tokens the model will accept in a single forward pass. It is fixed at training time. Frontier models in 2026 advertise windows of 200,000 tokens (roughly one novel, about the length of Crime and Punishment), 1,000,000 tokens (a modest bookshelf), and occasionally larger. Smaller open models typically live between 4,000 and 128,000, a long email to a short novella.
Everything you want the model to consider on a given call must fit in that window. System prompt, conversation history, retrieved documents, tool schemas, tool results, the current user turn, all of it, in tokens. Nothing outside the window enters the forward pass. If the context for a request sums to 90% of the window, every new turn has to displace something older, which is the source of “the model forgot” complaints on long sessions.
Here is the uncomfortable part. Advertised context length is not the same as usable context length. Liu et al.’s Lost in the Middle (Liu et al. 2024) measured that retrieval accuracy drops sharply when the relevant information sits in the middle of a long context. Models reliably use the beginning. They reliably use the end. They unreliably use the middle. A model with a 1,000,000-token window may usefully retrieve from well under 100,000 tokens for any given question. We spend a full chapter on this in Part IV (Chapter 15).
Apply the baloney-detection kit here too (Sagan 1995). When a vendor announces a “1,000,000-token context window,” the claim is reproducible only in the narrow sense that the model accepts that many tokens without crashing. Whether the model uses that many tokens effectively is a different claim, rarely tested by the vendor, and often quietly failing in the middle. Context-window numbers on vendor pages are throughput limits, not quality guarantees.
There is also a systems reason context windows exist at all. Naive attention as specified in Vaswani et al. (Vaswani et al. 2017) is quadratic in sequence length. For a sequence of length n, each layer does n² work. That was tractable at n = 512 and painful at n = 32,000. FlashAttention (Dao et al. 2022) restructured the computation to be IO-aware: keep intermediate values in fast on-chip memory instead of constantly round-tripping to slower main memory. Exactly the same result, 2 to 4 times less memory traffic. FlashAttention-2 (Dao 2023) pushed further. Without this line of systems research, the 1,000,000-token windows you see today would not exist.
Treat the context window as the working memory the model has for this one request, and treat the advertised length as an upper bound rather than a usable budget. Measure effective context on your own workload, because Lost in the Middle is a property of the architecture still showing up in 2026 evaluations.
3.5 Why these pieces matter for operators
Tokenization explains why models miscount letters. When a model confidently answers that “strawberry” has two R’s, it has not made an arithmetic error. It was asked a question about characters by a system whose entire input substrate is tokens. The information needed to answer is not in its input. Chapter 7 shows how to structure tasks so token-level failures do not masquerade as reasoning failures. A one-character reformulation, or asking the model to spell the word first, recovers most of the accuracy.
Attention and context explain why things get lost. Lost in the Middle (Liu et al. 2024) and its cousins (attention dilution, position-bias in retrieved chunks, noisy-context degradation on RAG) all come from how attention mixes information across positions. Once you have seen the mechanism, the failure modes in Chapter 15 are predictable and the mitigations (reranking, windowed retrieval, placing critical content at the ends of the context) become obvious.
Sampling explains why temperature 0.7 vs 1.2 changes output. The sampler operates on the distribution the model produced. Low temperature collapses it onto the mode; high temperature spreads it across the tails. The right choice is workload-dependent: determinism for structured extraction, exploration for creative generation, something between for conversational assistants. Chapter 9 gives measured defaults. The model underneath is always outputting the same distribution. You are choosing how to draw from it.
Training stages explain why prompts land differently. System prompts work because SFT conditioned the model on instruction-following. They work less well when they push against preference-alignment behavior the model was trained to hold. Few-shot examples work because in-context learning is a pre-training emergent property. When a prompt is not working, the diagnostic question is which stage’s prior you are fighting. Chapter 6 and Chapter 7 are built around that question.
Statelessness explains why your “chatbot” is an application, not a model. Section 1.1 argued this from the outside. This chapter showed it from the inside. There is no state between requests because the pipeline takes a token sequence in and returns a distribution. Everything that looks like memory (conversation history, retrieved documents, tool results) is the application staging context into the next forward pass. That is the whole design space for memory systems (Chapter 14), retrieval (Chapter 10), and tool use (Chapter 12).
Take the running example from Section 1.1. You paste a forty-page research paper into Claude and ask for five bullets. Every piece of this chapter is in that one call. Tokenization turns the forty pages into roughly 25,000 integers. The embedding table lands them as points in an 8,000-dimensional space. The transformer stack mixes those points across positions, sixty or so layers deep, so that by the top of the stack the hidden state at the last position has attended to sentences from pages three, seventeen, and thirty-nine at once. The final projection scores all 128,000 vocabulary tokens. The sampler picks one. It gets appended. The loop runs again on 25,001 tokens. Five bullets later, a stop token comes out. Pre-training is why the words sound like a paper summary. SFT is why you got bullets. Preference alignment is why the tone is helpful. Every knob you will touch in the rest of the book (prompt wording, system prompt, tool calls, retrieval, sampler settings, model choice, quantization) shapes one specific stage of that pipeline.