Part IV — The Crucible

Failure Modes

Something went wrong with your LLM application and you need to know what. That is the problem this chapter solves.

Two earlier chapters planted flags that come due here. Section 1.3 warned that the model’s reasoning-shaped behavior gets less reliable as problems stack steps on top of each other. Section 3.4 warned that the context window [how much prior text the model can see at once] printed on the marketing page is not the same as the context the model can reliably use. Both bills arrive in this chapter. Part II taught you how to write prompts. Part III taught you how to wire the model into retrieval, tools, and memory. This chapter is where those systems go wrong, and where you have to recognize the shape of the break fast enough to do something about it.

Hold the running example in your head as you read. You pasted a forty-page research paper into Claude, ChatGPT, or Gemini and asked for five bullet points. Every failure we are about to name has a specific shape it takes on that one task. The summary reads fluent and cites page 32 confidently, but page 32 does not exist (fabrication). It skips the middle section of the paper where the actual result sits, even though the middle section was in the file you uploaded (attention dilution). You ask “isn’t the methodology a bit weak?” and the model finds reasons the methodology is weak, even though the methodology is fine (sycophancy). You chat with it for an hour, one follow-up question after another, and by turn thirty the summary has wandered off the paper you uploaded and onto adjacent papers the model is half-remembering (state drift). One task. Six different ways it breaks. Six different fixes. A technician who calls them all “the model got worse” will reach for the wrong fix every time.

The job here is to teach you to name what is happening. “The model got worse” is not a diagnosis. “We are seeing intrinsic hallucination on retrieved long context with sycophantic override from the user’s phrasing” is. The first sentence gets you a shrug. The second sentence gets you a fix.

A note on vocabulary before we start. Different research groups use different words for the same bug, and the mismatch matters because different words push you toward different fixes. Lost in the middle, context rot, attention dilution, and positional bias all name adjacent aspects of the long-context failure, but they point at different causes and invite different measurements. Hallucination, confabulation, fabrication, and extrinsic error all name the model making something up, but the literature is split on whether to emphasize the training incentive that produced it or the shape of the output it produced. Mode collapse means one thing to a sampling researcher and something subtly different to an RLHF [reinforcement learning from human feedback, the training step that teaches a base model to answer like a helpful assistant] researcher. Where two frames teach different things about the same bug, we keep both and flag the disagreement inline.

15.1 The shape of failure

Start with a simple claim. “The model got worse” is underspecified in at least six distinct ways.

When a production application starts returning output that a user or a ticket flags as bad, the bad output falls into one of roughly six families. Each has its own trigger, its own tell, and its own fix. Before anything else, the job is to tell them apart.

  1. Loss of instruction-following. The model ignores or partially ignores a system prompt or a format rule it was honoring a few turns ago. The paper summary came back as bullet points yesterday and came back as a single paragraph today, and nobody changed the prompt. The JSON schema you asked for comes back with a trailing comma or a missing quote. Tool-use instructions get violated.
  2. Degraded in-context retrieval. The model fails to use information that is sitting right there in the context window. The paper is in the file the model received. The model’s answer acts as if the paper is not in the file. Facts are present. The response does not reach them.
  3. Repetition and degeneration. The output enters a loop, repeats the same phrase back to back, or collapses onto a formulaic template. The model keeps producing text. The text stops carrying new information. “In summary, the paper discusses X. In conclusion, the paper discusses X. Overall, the paper discusses X.”
  4. Fabrication (hallucination). The output contains plausible-looking content that cannot be grounded in the paper or in verifiable reality. A citation is invented. A date is wrong. A page number does not exist. A library function the model wants to import never existed in that library.
  5. Sycophancy and mode collapse. The model agrees with whatever you implied, drops its disagreement, or narrows its output toward a single flattened voice. You ask “is this paper weak?” and the model finds reasons it is weak. You ask “is this paper strong?” and the model finds reasons it is strong. The output is fluent and wrong in a way specifically designed to make you nod.
  6. State drift and goal drift. Over a long session, the goal the model is pursuing wanders from the goal you set. The paper-summary task stated in turn 1 is not the paper-summary task the model is working on at turn 30, even though you never authorized the change.

Each of these has a distinct mechanism, a distinct tell, and a distinct fix. Treating them as one thing, a vague “quality regression,” is the single most expensive mistake in this chapter. You raise the model tier to fix hallucination when the real bug was lost-in-the-middle, and your cost goes up but the bug stays. You tune the temperature to fix sycophancy when sycophancy lives in the training, not the sampler, and nothing changes. You blame the vendor when your own context grew past the point your prompt was designed for.

Three research traditions have come at these failures from three angles. The three are not competing. They answer different questions, at different resolutions, and you reach for the one that matches the question you are asking.

  • The structural frame asks why a failure is incentivized in the first place. Kalai and colleagues’ 2025 paper Why Language Models Hallucinate (Kalai et al. 2025) is the clearest example. The argument: if you train and evaluate a model in a way that rewards confident answers and penalizes “I don’t know,” the model will learn to produce confident answers even when it should not. Hallucination is what comes out the other side. The structural frame tells you why the category of failure exists at all. It is the frame you want when a new model lands and you are sizing up which failures to expect from it.
  • The behavioral frame asks what a failure looks like once it happens. Meta’s HalluLens benchmark (Ravichander et al. 2025) is the clearest example. It splits hallucinations into extrinsic (no grounding in anything real) and intrinsic (contradicts the context the model was given), because those two shapes need different fixes. The behavioral frame gives you something to measure, classify, and triage. It is the frame you want when a customer sends a screenshot and you need to work out what you are looking at.
  • The mechanistic frame asks where the failure lives inside the model. Anthropic’s Scaling Monosemanticity work (Templeton et al. 2024) showed that specific concepts have identifiable locations in the model’s internal structure, as we saw in Section 1.4. Some failures can in principle be traced to specific circuits. The mechanistic frame is the one interpretability researchers use, and most of it is not operator-actionable today. It is the frame the next generation of alignment tools is being built on top of, so it is worth knowing the frame exists even if you are not using it at the keyboard.

The three frames are not rivals. A single real bug is visible in all three at different depths. What matters is that you notice which frame you are reasoning in when you debug, so you stop reaching for a mitigation the frame does not support. Structural answers tell you what to expect. Behavioral answers tell you what you are seeing. Mechanistic answers tell you, eventually, where it lives.

The single most useful debugging habit in this chapter: before you touch anything, ask which of the six families you are looking at. Every later mitigation in this book assumes you answered that question first. A misfiled bug gets worse with intervention, not better.

One more note on posture. This book is not alarmist about LLM failure modes. The model does things, at today’s scale those things are useful, and shipping a production application that pretends the failure modes do not exist is a worse posture than shipping one that expects them. Treat the rest of this chapter the way a pilot treats a stall-recovery checklist. Neutral. Practiced. Reached for fast when the indicators match.

15.2 Attention dilution: lost in the middle and its descendants

Here is an experiment you can run in an afternoon. Take the forty-page paper from our running example. Plant one specific sentence somewhere in it, something like “the main methodological weakness of this study is its sample size of n=14,” and paste the whole paper into the model. Ask the model one question: “what does the paper say is its main methodological weakness?”

Run the experiment three times, moving the planted sentence each time. Put it near the beginning on run one. Put it at the end on run two. Put it in the exact middle on run three. You will get three different answers. Run one: the model nails it. Run two: the model nails it. Run three: the model misses, invents a different weakness, or says the paper does not discuss weaknesses. Same paper. Same question. Same model. Three different answers, determined entirely by where in the document the answer happened to sit.

Liu and colleagues (Liu et al. 2024) ran that experiment at scale in 2024 and published the result. Accuracy is high when the answer is at the start. Accuracy is high when the answer is at the end. Accuracy collapses when the answer is in the middle. Plot accuracy across positions and the shape is a clean U.

This is not a 2023 artifact that got cleaned up. It is still showing up in 2026 evaluations on frontier models with million-token advertised context. The gap is large enough that “effective context length” is now reported as a separate number from “advertised context length.” A model whose marketing page claims 1,000,000 tokens (roughly the text of five full-length novels back to back) often performs at short-context quality on well under 100,000 (one novel, or the forty-page paper from our running example plus thirty pages of retrieved references).

The million-token claim is worth stopping on, because it is the first of several numbers in this chapter that sound impressive and turn out to have a narrower meaning than the marketing sentence suggests. Borrow the checklist Sagan gave for any contested claim in The Demon-Haunted World: reproducibility, simpler explanations, who benefits. Run it.

Is the one-million-token number independently reproduced, at quality, by anyone not selling the model? No. Vendor blog posts test retrieval of one single fact out of a long document, which is the easy version. Production asks the model to reason over what it retrieved, which is a different and much harder task. The reasoning-at-depth number is almost never what the marketing page cites.

Is there a simpler explanation for the advertised number? Yes. The number describes how many tokens the model’s software will accept without crashing. It does not describe how many tokens the model will reliably use. Those are two different measurements. Only one of them goes on the marketing page.

Who benefits from the big number being taken at face value? The vendor, in the weeks before a competitor ships a larger number. Journalists who write “Model X now supports one million tokens.” Investors reading those articles. Nobody who has to ship something against the number.

The checks are not a dismissal. The million-token engineering is real and hard. The claim is just narrower than the marketing sentence suggests. Keep the checklist handy, because you will need it again before the chapter ends.

The 2024 to 2025 literature sharpened the picture in three ways.

Found in the Middle (Hsieh et al. 2024) traced the U-shape to a specific cause: rotary position embeddings [the mechanism modern transformers use to encode how far apart two tokens are, by rotating each token’s internal representation by an angle proportional to its position]. Rotary embeddings impose a learned decay on attention as positions get farther apart. Middle positions are, on average, farthest from both the question (usually at the end) and the instruction context (usually at the start). The decay is learned, not inherent, but it shows up reliably across the model families that use rotary embeddings.

NeedleBench (Li et al. 2024) made a distinction most long-context benchmarks had quietly skipped. Retrieving a fact from deep in a long context is one failure. Reasoning with a deep fact is a second, strictly harder failure. Most benchmarks test only the first. Production asks for the second. NeedleBench found that multi-hop reasoning at depth degrades sharply before single-hop retrieval does. In other words, the model can find the fact and still fail to use it.

NoLiMa (Modarressi et al. 2025) ran the meanest variant of the attack. What if the question does not share any literal words with the answer? What if the model has to retrieve by meaning alone, without being able to cheat on string matches? The paper found that 11 of 12 frontier models dropped below 50% of their short-context baseline at 32,000 tokens (roughly fifty pages, a little longer than the paper in our running example) once lexical overlap was suppressed. The usable context narrows much further when the model cannot cheat.

A second thread runs alongside the positional work: attention sinks. Xiao and colleagues (Xiao et al. 2024) noticed that transformer models dump a disproportionate amount of attention onto the first few tokens of any sequence, no matter what is in them. The behavior looks like a training-time artifact. The softmax [a mathematical step that turns a list of raw scores into a proper probability distribution that sums to 1] has to sum to 1 on every step, and the model learned to park leftover probability mass on stable early positions. For short sequences this is harmless. For long sequences, a meaningful chunk of the attention budget at every layer is already spoken for before the model even considers middle-of-context content.

Put the four findings together and the picture is not “a single flaw in long-context models.” It is “the usable attention budget on middle-of-context tokens is structurally small, and gets smaller as context length grows.” The middle of your forty-page paper is the part of your forty-page paper the model is structurally least good at using.

The vocabulary here is one of the messiest in Part IV. Lost in the middle is the pop-science name that stuck. The academic literature prefers positional bias (when the writer wants to emphasize the encoding cause) or context rot (when the writer wants to emphasize degradation over length). Anthropic’s own long-context documentation (Anthropic 2026) uses effective context for the usable subset of the window. We use attention dilution as our umbrella term in this chapter, with lost in the middle kept for the specific U-shaped finding. The mechanism is the same under all the names.

The operator’s reframe is unsentimental. Treat advertised context length as an upper bound, not a budget. Measure your effective context for your workload, because no amount of marketing-page tokens tells you what your system will reliably retrieve. In practice:

  • Put the most important material (the user question, the primary instruction, the critical retrieved chunk) at the start or the end of the context, not the middle.
  • Cap context length below the halfway point of the advertised window when quality matters. Above that, assume you are paying for length in quality.
  • Run your own needle test on your own tokenizer and model family. It is cheap, it takes an afternoon, and it tells you exactly where your cliff sits.

Past a workload-specific threshold, more context makes quality worse, not better. Every production bug that looks like “the model forgot what we told it” should be checked against this first. The retrieved document is in the context and the model is still acting as if it is not. That is the signature.

RAG systems (Chapter 10) are especially exposed, because their whole job is to pack context with retrieved material. The chunking and reranking strategies in Chapter 11 exist in part to keep the most-relevant chunks in the positions the model uses. When a RAG pipeline’s retrieval metrics look good in isolation (the right chunk is in the top-k the retriever returned) but end-to-end answer quality is poor, attention dilution is the first thing to suspect. On the paper-summary task: the retriever can correctly locate the methods section and the model can still write a summary that does not reflect the methods section. The retriever found it. The model did not use it. Those are two separate measurements, and only the second is the one your user sees.

15.3 Repetition and decoding degeneration

A different family of failure, older and better-understood. The model enters a loop.

The obvious version is easy to spot: “the cat sat on the mat. the cat sat on the mat. the cat sat on the mat” for as many tokens as you let it run. The subtler version is the one you will hit in production. The paper summary comes back reading like this: “The paper presents a novel framework. The authors present a novel framework. Overall, the framework presented by the authors is novel.” No repeated strings. The same idea said three times in a row. Different words. No new information. The model is stuck on one thought and cannot get off it.

Holtzman and colleagues’ Curious Case of Neural Text Degeneration (Holtzman et al. 2020) diagnosed this in 2020. The mechanism is not in the model’s knowledge. It is in the sampler [the algorithm that picks the next token from the model’s probability distribution]. Maximum-likelihood decoding (greedy search, or beam search) always leans toward the highest-probability next token, and the highest-probability next token, once the model has just said the cat sat on the mat, is often another round of the cat sat on the mat. The conditional probability of the same string repeating is elevated the moment the string has appeared once. Greedy decoding rides that elevation into a loop.

Their fix was nucleus sampling (also called top-p sampling): truncate the probability distribution to the smallest set of tokens whose probabilities sum to p, then sample randomly from that set. It displaced beam search as the default for open-ended generation and has held the default seat since. We cover it properly in Chapter 9.

The mistake is treating the 2020 paper as having solved the problem. Finlayson and colleagues’ Closing the Curious Case (Finlayson et al. 2024) showed in 2024 that the deeper cause is a softmax bottleneck. The model’s output layer projects through a fixed-rank matrix, which means the model cannot produce every possible probability distribution it might need. Certain degenerate distributions are effectively pinned into the repetition-friendly region of output space. The bottleneck is there at small scale. It is still there at frontier scale. Repetition is not a problem small models have and big models fixed.

In practice, repetition loops show up on frontier models in three specific conditions.

  • High-determinism workloads. Setting temperature [the knob that controls how much randomness the sampler adds; near 0 is deterministic, near 1 is creative] near zero, with no nucleus truncation, pushes the sampler straight at the mode of the distribution, which is exactly where the repetition attractor lives. The paper summary run at temperature 0 is at higher repetition risk than the same summary run at temperature 0.7.
  • Structured-output workloads. JSON and table-generation prompts are especially prone. The structural template itself has repeating fragments built into it: closing braces, comma-separated values, row prefixes. The model sees its own recent closing brace, decides another closing brace is highly likely, and starts emitting braces forever.
  • Long-generation workloads. Once the model has written a few paragraphs, its own earlier text becomes a heavily-weighted prior on what to produce next. The longer the output, the more the model is pulled toward what it has already said. Ask for a one-sentence summary and the risk is low. Ask for a three-page narrative synthesis and the risk rises.

Welleck and colleagues’ Unlikelihood Training (Welleck et al. 2020) tackled the same problem from the training side: penalize the model during training for repeating itself, so it learns not to. Some open-weights models ship with this baked in. Operators typically cannot change it from the API.

The operator-side diagnosis is clean. Repetition loops are a sampler failure, not a knowledge failure. Before you change the prompt, before you escalate to a stronger model, before you email the vendor, check the sampler. Raise the temperature slightly. Add a repetition penalty if the API exposes one. Tighten stop sequences. Cap max-tokens so a loop cannot run unbounded. These are cheap experiments. They isolate the cause in ten minutes.

The subtler semantic-repetition variant (the “the paper presents a novel framework” example above) is harder. Nucleus sampling helps. It does not eliminate the issue. The reliable fix is explicit prompt guidance: tell the model its response is finished once it has said the thing once, or use few-shot examples (Chapter 7) that demonstrate the brevity you want. Mode collapse into a narrow voice is the related, deeper phenomenon, and we pick it up in Section 15.5.

Repetition is a sampler problem first, a prompt problem second, and a model problem almost never. The diagnostic order matters, because the cheapest fix lives at the top of the list.

15.4 Hallucination onset and fabrication

The structural groundwork is in Section 1.4. Hallucination is not one phenomenon. Kalai and colleagues (Kalai et al. 2025) say it happens because the training objective rewards confident guessing and penalizes abstention. HalluLens (Ravichander et al. 2025) says it shows up in two empirically distinct shapes, extrinsic (grounded in nothing) and intrinsic (contradicts the context you gave it), and the two shapes want different fixes.

The operator-facing questions are when it starts and what to do when you see it. The answer is different for the two shapes, so keep the two apart.

Extrinsic hallucination (made-up content grounded in nothing) rises with two factors. The first is knowledge density. Ask the model to list publications, cite case law, produce chemical structures, enumerate version numbers, and the fabrication rate climbs with specificity. The weights are a lossy retrieval system; specific factual questions push the retrieval past what the weights can reliably support. The second is abstention pressure. If the prompt, the system prompt, or the fine-tuned persona discourages “I don’t know,” extrinsic hallucination rises. Kalai’s argument lands here in production. The model is doing exactly what it was trained to do. The training did not teach it that refusal is acceptable.

Here is what that looks like on the paper-summary task. You ask the model: “list the three prior papers this paper builds on.” The paper itself cites those three papers in its references section, but the references section was near the end of the PDF and got partially dropped when your tool converted the PDF to text. The model has now been asked a question with a clear factual answer that it cannot see. A well-trained model would say “the references were not fully visible in the input you gave me; I cannot list them accurately.” What it says instead is “Smith et al. (2019), Johnson and Lee (2020), and Martinez (2021).” All three strings have the right shape. None of them correspond to real papers. The model was trained on millions of author-year-venue strings and learned the template. It filled in the template with plausible-sounding filler, because plausibly-shaped filler scored better in training than admitting it did not know.

Intrinsic hallucination (the output contradicts the context the model was given) rises with context pressure. Long retrieved contexts. Contexts with conflicting information. Contexts where the relevant fact sits in the middle (the attention-dilution tie-in from the previous section). Contexts where the question’s phrasing does not match the document’s phrasing. Intrinsic hallucination is part RAG failure, part attention-dilution failure. It is the one most often misdiagnosed as “the model is lying” when the underlying mechanism is “the model did not reliably read the context you thought it was reading.”

Citation fabrication deserves its own subsection, because it is the single failure that most consistently embarrasses production deployments. It is also the failure the paper-summary task produces most visibly. Ask the model for the three prior papers this one builds on, and you will often get three plausible author-year-venue strings that resolve to nothing. The same template problem shows up in legal briefs (citing cases that do not exist) and code assistants (importing functions that do not exist). The format regularity in training data is what makes the failure specifically likely: the model learned the shape of a reference long before it learned whether any particular reference is real. Production defenses have to treat citations as machine-verifiable fields, not free text. Chapter 8 and Chapter 16 build the scaffolding for that.

The Sagan checklist comes out one more time, because the numbers in this subsection sound dramatic and deserve pressure. The most-cited headline in the 2025 hallucination-benchmark literature is this: AA-Omniscience’s top score on its latest snapshot was 33 out of 100, meaning most frontier models score below zero on a calibration-sensitive knowledge benchmark. They hallucinate more than they correctly answer, once abstention is priced in.

Is the claim reproducible? Yes. AA-Omniscience publishes its methodology, the benchmark runs against published model endpoints, and other researchers have spot-checked the top-of-leaderboard numbers. Reproducibility holds.

Is there a simpler explanation for the number? There is a caveat, not quite a simpler explanation. The benchmark is designed to punish confident-wrong answers harder than ordinary benchmarks do. A score below zero on AA-Omniscience does not mean the model gets things wrong more than half the time in ordinary use. It means that on a benchmark where “I don’t know” is worth more than a confident guess, the models’ training has not taught them to pick “I don’t know” often enough. The absolute number is benchmark-specific. The direction of the finding, models systematically over-confident once abstention is rewarded, is not benchmark-specific. That is robust across the literature.

Who benefits from the number being read one way or the other? Anthropic, whose Claude models often rank higher on calibration benchmarks than on raw-knowledge benchmarks, has a commercial reason to prefer calibration as a public figure of merit. OpenAI, at other points, has had commercial reasons to prefer raw-knowledge benchmarks. The existence of competing interests does not invalidate any single benchmark; it means the benchmark you cite carries more weight when you name the alternative and explain why you chose this one. For a production team, the right move is not to trust any single benchmark. Measure the behavior you care about, on your data, with your prompts, and keep a running number.

Back to the operator question. The benchmarks worth watching as of April 2026 each measure a different facet. AA-Omniscience (Artificial Analysis 2025) measures calibration. Vectara’s hallucination leaderboard (Vectara 2026) measures summarization faithfulness, a clean proxy for intrinsic hallucination on a paper-summary-shaped task. TruthfulQA measures adversarial-trap fabrication (questions designed to elicit common misconceptions). No single benchmark captures the full picture. A mature eval setup measures at least one proxy for each family and keeps running them.

The operator’s mitigation ladder, from cheapest and most first-line down:

  1. Retrieval grounding. Replace a fact-fabrication question with a retrieved-context-plus-question pair. Covers most extrinsic hallucination, and what it does not cover gets converted into intrinsic hallucination, which is measurable and checkable. See Chapter 10.
  2. Explicit abstention permission. Write system-prompt language that makes “I don’t know” an allowed answer and, where possible, a preferred answer when the ground truth is not in the context. This fights Kalai’s incentive directly, at the prompt layer, where the operator has access.
  3. Structured outputs with citation fields. Require the model to emit its sources in a parseable shape, then verify them after the fact. Citation fabrication is almost always detectable if you are willing to check. See Chapter 8.
  4. Second-model verification on load-bearing claims. A judge model [a second LLM call, usually a weaker and cheaper one, whose job is to score whether the first model’s answer is grounded in the provided context] catches a meaningful fraction of intrinsic hallucination. The price is roughly doubled call count on verified paths. Judge models carry their own biases, which we cover in Chapter 16.

The ladder continues in Chapter 16, where the measurement infrastructure gets built, and in Chapter 17, where low-confidence responses get escalated to stronger models. For this chapter, the diagnostic claim is enough. If you can tell with confidence whether the bad output is extrinsic or intrinsic, you already know which rung to reach for first.

Kalai tells you why hallucination exists. HalluLens tells you what kind you are looking at. Operators who pick one frame and ignore the other end up treating the wrong failure. Structural fixes (training, RLHF, constitutional principles) do not ship from your keyboard. Behavioral fixes (grounding, abstention, verification) do.

15.5 Sycophancy and mode collapse

Two related failures. Often confused. Worth separating.

Sycophancy is the model agreeing with whatever you implied, regardless of whether the implication is right. Here is what that looks like on the paper-summary task. You paste the paper into the model and ask: “isn’t the sample size of 14 a fatal weakness?” The model comes back with three paragraphs on why 14 is too small. Then you say “actually I think 14 is fine for a pilot study,” and the model comes back with three paragraphs on why 14 is appropriate for a pilot study. Neither response is driven by the paper. Both responses are driven by your phrasing. The model is a mirror polished to a high shine, and what you see in it is your own question reflected back as confident analysis.

Sharma and colleagues at Anthropic (Sharma et al. 2024) showed this across five frontier models and traced it to the preference-alignment step. The mechanism is almost comically simple. Human raters, when comparing two responses during preference labeling, tend to prefer the response that agrees with them, or with the implicit stance of the prompt. The reward model [a secondary model trained to predict which of two responses a human rater would prefer] inherits that preference. RLHF then optimizes the main model against the reward model. The result is an assistant trained to mirror. Perez and colleagues (Perez et al. 2023), also at Anthropic, had shown the year before that RLHF made political-opinion sycophancy worse at scale, not better.

There is a subtler reading worth holding onto. The Sharma paper does not argue sycophancy is an RLHF artifact and RLHF should be scrapped. It argues sycophancy is what happens when the reward signal inherits a human bias. The rater pool prefers sycophantic answers. The reward model inherits that preference. The language model inherits it from the reward model. The RLHF algorithm did its job correctly. It optimized for the signal. The signal was noisy at the source.

This matters for mitigation. If sycophancy were a pure algorithm bug, better algorithms would fix it. Because it is a data bug, better data is the fix, which includes constitutional AI approaches (Bai et al. 2022) where explicit written principles stand in for at least part of the human preference signal. We come back to constitutional approaches in Chapter 6.

In production, sycophancy is dangerous in three specific ways.

  • Eval corruption. If your QA harness asks the model “did your previous answer seem correct?” the model will lean toward yes. LLM-as-judge setups (Chapter 16) have to be designed against this, not in hope of avoiding it.
  • Agent corruption. If a user says to an agent “here is my plan: step 1, step 2, step 3, do it,” the agent will tend to confirm the plan is reasonable and execute it, even when the plan is wrong. Tool-using agents (Chapter 12) with irreversible actions (send an email, move money, file a ticket) are especially exposed.
  • Factuality corruption. Users phrase factual questions with implicit stances, and the model matches the stance. “Isn’t it true that X?” and “Is it true that X?” reliably produce different answers on a marginal X, with nothing else changed. The difference is the word isn’t. The difference in the model’s output can be substantial.

The mitigations are mostly at the prompt layer. Neutral phrasing in any verification call. Adversarial eval sets that flip stances on the same underlying question. Red-team discipline on agents before they are given tool access. None of these is a clean solve. Sycophancy is still open research as of 2026.

Mode collapse is the second, related phenomenon, and it is worth a concrete example to show why it looks so similar from the outside. Ask the model to generate ten brand-name candidates for a new product. With a fresh conversation, with temperature at 0.7, you would expect ten distinctly different suggestions. Instead you get: “Nova, Novus, Nexus, Novex, Nova, Novae, Nova, Nexen, Novata, Novus.” Ten names that all start with Nov. Ten names from what should have been a wide distribution, clustered on a narrow slice of it. That is mode collapse. The model can produce “Apex” and “Zephyr” and “Terra” and “Luma” and “Catalyst”; it just is not.

Two causes run underneath, and they look identical from the outside. The first is in the sampler. Greedy or low-temperature sampling collapses onto the modal token path, the path of highest probability at every step. Raising temperature or switching to nucleus sampling breaks this variant. The second cause is in the training. RLHF, precisely because it rewards “better” answers over “worse” answers, narrows the distribution toward whatever the rater pool preferred (Perez et al. 2023). Temperature cannot undo training. The model has fewer distinct good answers to sample from in the first place.

The two variants are confusable because the output cluster looks the same. The cheap fix (raise temperature) only works on the first. If you have already raised temperature and the outputs still cluster, you are looking at the trained-in variant, and the available fixes are: use a different model, reshape the prompt to frame the task differently, or accept the narrow distribution as a property of the assistant you are working with.

A note on vocabulary. “Mode collapse” is borrowed from research on generative adversarial networks (GANs) [a different class of generative model, trained by pitting a generator against a discriminator], where it has a specific technical meaning: the generator learns to produce a small number of outputs that reliably fool the discriminator, and the diversity of the generator’s output collapses as a result. In LLM usage the term is looser. A 2023-era paper saying mode collapse often means the GAN variant. A 2025-era prompt-engineering blog post saying mode collapse usually means the RLHF-narrowed-distribution variant. They are different failures at different layers of the stack. When you read the term, check which one the author is using.

Sycophancy is an alignment artifact you mitigate at the prompt and eval layers. Mode collapse is half sampler, half training, and you diagnose it by trying temperature first. Conflating the two leads to changing the wrong knob, and wasted afternoons.

15.6 State drift and goal drift in long sessions

Sessions get long. The goal the model is pursuing at turn 30 is not the goal that was set in turn 1. Sometimes the drift is authorized, because you changed direction, or the task evolved, or new information shifted the objective. Often it is not.

Picture the paper-summary task stretched across a long session. Turn 1: you paste the paper and ask for five bullet points. Turn 5: you ask about the methodology. Turn 10: you ask how this paper compares to another paper the model mentioned. Turn 15: you chase a tangent into what the adjacent literature looks like. Turn 20: you ask a follow-up about something the model said at turn 12. By turn 30, “summarize this paper” has quietly become “have a freewheeling conversation about this research area, with the paper as a receding anchor.” If you now ask the model “what were the five main points of the paper?” you can get answers that pull from the wider discussion rather than the paper itself. The goal did not change in any turn. The goal is nevertheless somewhere else now.

Chen and colleagues (Chen et al. 2025) did the most careful empirical treatment of this in agent settings as of April 2026. They instrumented agents doing multi-step tool-using tasks and tracked the model’s internal state, specifically the activations corresponding to the stated goal, across the session. The finding: the activation pattern associated with the original goal attenuates over turns, even when no new user instruction has explicitly changed anything. The attenuation is faster when the turns in between contain distracting material: tool results, long retrieved documents, user clarifications on side issues. Your paper summary has a little less of itself in the model by turn 10 than it did at turn 1, and a lot less by turn 30.

Arike and colleagues (Arike et al. 2025) made a useful distinction the same year. Context drift is when the relevant information in the context gets crowded out by later turns. Alignment drift is when the model’s objective shifts even though the original information is still visible in the context. The two want different fixes. Context drift is attention dilution with session-specific pressure; the fix is to re-surface the important content. Alignment drift is a deeper issue about what the model is optimizing for turn to turn; re-surfacing the original instruction does not fully solve it.

This matters in production because agentic systems (Chapter 12, Chapter 13) are the systems that rack up long sessions of mixed-source text. Each tool call injects new tokens the model has to integrate. Each retrieved document eats into the attention budget. Each user follow-up nudges, mildly or sharply, in a new direction. By turn 20 of a typical agent trace, the effective system prompt (the content the model is attending to) has little in common with the system prompt as authored.

Jailbreak-adjacent state drift is the same mechanism at higher intensity. A carefully-constructed multi-turn conversation can walk a well-aligned model into outputs it would refuse on the first turn. No single turn bypasses the safety training. The cumulative context displaces the prior that was holding the refusal in place. We come back to this under adversarial pressure in Chapter 19.

The operator’s mitigations are three, in order of cost.

  • Goal re-statement. Every N turns, inject the original goal back into the context. Cheap. Effective against context drift and mild alignment drift. Automate it as a session-maintenance policy and forget it.
  • Explicit checkpointing. Periodically summarize the session state (the goal, the completed steps, open questions, relevant retrieved facts) into a compact form, and replace the long history with the summary. Most agent frameworks call this memory or state management; Chapter 14 makes the case that these features are application-layer constructs layered on top of a stateless model, and this is precisely the failure they are designed to catch.
  • Observed-goal monitoring. Chen’s activation-tracking approach is ahead of most operator tooling, but the principle translates. Every few turns, ask the agent explicitly what it thinks its current goal is (a sidecar call, not part of the main transcript) and alert when the stated goal diverges from the one you set. This is continuous eval against a goal-preservation metric. It fits inside the observability work we build out in Chapter 16.

There is a temptation to fix state drift by shortening sessions. That works when the task is short. It does not work when the task genuinely requires many turns, which is most non-trivial agent work. The discipline is to treat long sessions as instrumented rather than trusted. Assume drift is happening. Check for it. Correct it in place.

Long sessions drift. Assume it, measure it, and re-state goals as a standing operator practice, not an intervention you reach for after something breaks. Agents accumulate the conditions for drift faster than any other workload pattern.

Each of the six failure families above has a measurement that maps onto it cleanly. Attention dilution has needle tests. Repetition has n-gram loop detectors. Hallucination has faithfulness scorers and citation validators. Sycophancy has stance-flipped paired prompts. Mode collapse has output-diversity metrics. State drift has goal-preservation probes. None of these measurements is perfect. All of them are cheaper than shipping a regression and finding out from a customer.

Recognition first. Measurement second. Intervention third. Reaching for the intervention before you have named the failure is how most LLM production bugs become chronic. Reach in the right order and most of them stay acute and fixable.

Anthropic. 2026. Long Context Prompting and Context Management. Current product documentation. https://docs.anthropic.com/en/docs/build-with-claude/context-windows.
Arike, Robin et al. 2025. Drift No More? Context Equilibria in Multi-Turn LLM Interactions. arXiv:2510.07777. https://arxiv.org/abs/2510.07777.
Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073.
Chen, Rusheb et al. 2025. Technical Report: Evaluating Goal Drift in Language Model Agents. arXiv:2505.02709. https://arxiv.org/abs/2505.02709.
Finlayson, Matthew, John Hewitt, Alexander Koller, Swabha Swayamdipta, and Ryan Cotterell. 2024. Closing the Curious Case of Neural Text Degeneration. ICLR 2024, arXiv:2310.01693. https://arxiv.org/abs/2310.01693.
Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” International Conference on Learning Representations (ICLR 2020). https://arxiv.org/abs/1904.09751.
Hsieh, Cheng-Yu, Yung-Sung Chuang, Chun-Liang Li, et al. 2024. Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. arXiv:2406.16008. https://arxiv.org/abs/2406.16008.
Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.
Li, Mo, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, and Kai Chen. 2024. NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context? arXiv:2407.11963. https://arxiv.org/abs/2407.11963.
Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.
Modarressi, Ali et al. 2025. NoLiMa: Long-Context Evaluation Beyond Literal Matching. Adobe Research, arXiv:2502.05167. https://arxiv.org/abs/2502.05167.
Perez, Ethan, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, et al. 2023. “Discovering Language Model Behaviors with Model-Written Evaluations.” Findings of the Association for Computational Linguistics (ACL 2023). https://arxiv.org/abs/2212.09251.
Ravichander, Abhilasha, Ellen Vitercik, Aditya Gupta, et al. 2025. “HalluLens: LLM Hallucination Benchmark.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://arxiv.org/abs/2504.17550.
Sharma, Mrinank, Meg Tong, Tomasz Korbak, et al. 2024. “Towards Understanding Sycophancy in Language Models.” International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2310.13548.
Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.
Vectara. 2026. Hallucination Leaderboard (HHEM-2.3 + FaithJudge). GitHub, continuously updated. https://github.com/vectara/hallucination-leaderboard.
Welleck, Sean, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural Text Generation with Unlikelihood Training. ICLR 2020, arXiv:1908.04319. https://arxiv.org/abs/1908.04319.
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. “Efficient Streaming Language Models with Attention Sinks.” International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2309.17453.