What LLMs Aren’t | Working with Large Language Models | Synoros

If you’ve ever used your phone’s autocomplete, you’ll know it usually goes something like this. Type “I’ll be there” and the keyboard suggests “soon” or “in five minutes” above the keys. Tap the middle suggestion and it gets added to your message. Keep tapping and the phone will write a full sentence for you, one word at a time, each word picked by guessing what is most likely to come next.

That is an LLM.

Not metaphorically. That is the actual mechanism. You type a prompt, the model reads it, picks the most likely next word, writes it down, and reads the whole thing again (your prompt plus the word it just wrote) to pick the next word. It does this hundreds or thousands of times per response.

The differences between your phone’s autocomplete and Claude, ChatGPT, or Gemini are not differences in what they do. They are differences of scale:

Your phone picks from a handful of likely next words. Claude picks from roughly 100,000 pieces of text (called tokens [a chunk of text, usually a word or part of a word]) on every step.
Your phone learned from what you personally type. Claude learned from a big fraction of the internet plus a lot of books.
Your phone can see the last sentence. Claude can see around 200,000 words of prior text at once, roughly the length of Crime and Punishment.

Same operation. Bigger machine. The entire rest of this book is about what that bigger machine can and cannot do, and why the people selling it to you have every reason to describe it as something more dramatic than an autocomplete in a loop.

Four misconceptions need clearing first. Each one sticks because the chat interface is designed to encourage it. None of them are the end user’s fault, but all of them will get in the way.

Here is the task this book will follow end to end. You have a forty-page research paper open on your screen. Maybe a friend forwarded it. Maybe your doctor mentioned a study and you want to check. Maybe you found it on arXiv and you are trying to decide whether to read the whole thing tonight. Whatever the reason, you want the five biggest things the paper says, without reading forty pages yourself. So you paste it into Claude, ChatGPT, or Gemini, and you ask.

1.1 Not a chatbot

The ChatGPT interface is a UI [user-interface, the visible part you type into] pattern. The thing underneath is not a chatbot. It is a stateless function [a function that keeps no memory of its past use] that reads a sequence of tokens and outputs its best guess at the next one.

Here is what is strange about that. Every “message” you send to Claude, ChatGPT, or Gemini gets resubmitted alongside every previous message in the conversation. The server remembers nothing about you. Not your last message. Not your last session. Not your name. Every API call [how the interface talks to the actual model] ships the whole transcript of the chat back and forth, and then the server throws it away.

Picture sending your third message in a chat. The actual data on the wire looks like this:

messages = [
    {"role": "user", "content": "Can you summarize this paper?"},
    {"role": "assistant", "content": "Sure, paste the text."},
    {"role": "user", "content": "[40 pages of paper here]"},
    {"role": "assistant", "content": "Here are the key findings..."},
    {"role": "user", "content": "Now give me just the methodology."},
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=messages,
    max_tokens=500,
)

The model does not know it has been in conversation with you for three turns. It knows it has been handed some tokens of prior text, and it is being asked to continue them. Everything that looks like “memory” from the user side of the interface is the application quietly re-submitting old text on every request.

The conversational quality of chat interfaces is an illusion made by resubmitting every prior turn. Companies go to real engineering trouble to make this affordable. A technique called PagedAttention (Kwon et al. 2023) stores the chat history in memory so the model does not have to re-read the whole conversation from scratch on every turn. But none of that work happens inside the model. It is all scaffolding around a stateless function.

This has practical consequences for our paper-summary task. If the paper is so long that earlier turns get trimmed out to stay within the model’s context limit [how much prior text the model can see at once], the model will “forget” parts you have already discussed. If you upload the paper as a tool result but it does not get re-included in a follow-up request, the model will act as if it never saw the paper. Memory features shipping in ChatGPT, Claude, and Gemini through 2025 and 2026 are all product features that save pieces of your conversations to a separate database and quietly slip them back into the prompt on future calls. They are not the base model remembering. We come back to this in Chapter 14.

1.2 Not a personal assistant

The model does not learn from you. Once training is over, the model’s weights [the learned numbers that make it work] are frozen. Whatever “personalization” feels like it is happening inside a conversation is the model paying attention to what you just said, not the model learning anything new (Brown et al. 2020).

You can demonstrate this in fifteen seconds:

# Call 1 (fresh API call, no history)
client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "My name is Kevin."}],
)
# Response: "Nice to meet you, Kevin."

# Call 2 (fresh API call, no history, no connection to Call 1)
client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What's my name?"}],
)
# Response: "I don't know your name, could you tell me?"

The second call has no connection to the first. If you want the second call to know about the first, you bring the first call’s text back in the messages array [data structure holding the conversation] yourself.

The architecture is fixed at the moment the model ships (Ouyang et al. 2022). A raw version of the model (called the base model) first reads an enormous pile of text and learns to autocomplete anything. Then a smaller round of training on example conversations written by humans teaches it the shape of a helpful answer. Then a final round uses human ratings of its outputs to nudge it toward the kinds of answers people actually prefer (Rafailov et al. 2023). After all that, the weights are frozen and the model is released. None of those training stages runs again while you are using the model.

What about “Claude memory” and “ChatGPT memory”? Those are product features built on top. The application detects something worth remembering (“Kevin prefers dense summaries”), saves it in a separate database, and quietly pulls it back out later to slip into the system prompt [the always-included instructions that sit above your message] on future calls. That pattern, save a relevant piece in a database and slip it into the prompt when needed, has a name in the field: retrieval-augmented generation (Chapter 10). The memory feature is that pattern with a friendly UI on top. At the model layer, every call is still stateless.

Everything that looks like “the model knows me” is scaffolding. The model knows exactly what was in this one request, and nothing more. That is why system prompts (Chapter 6), retrieval (Chapter 10), and memory tools (Chapter 14) all matter: they are how you stage useful context for a fundamentally stateless function.

This also explains the common complaint: why LLMs seem to “forget” things mid-conversation on a long thread. They have not forgotten. They were never told. Something on the client side got truncated to fit the context window, or a tool result failed to make it back into the next turn. Debugging that means looking at what was actually in the messages array for the failing call, not interrogating the model about why it is suddenly confused.

1.3 Not a reasoning engine

When an LLM produces a chain of step-by-step “thinking” that solves a word problem, something happens that resembles reasoning. The model generates intermediate work, uses it, and produces a final answer. Chain-of-thought prompting (Wei et al. 2022) measurably improves accuracy on arithmetic and logic tasks. The result is real.

Here is what that actually looks like in practice. Imagine a word problem with five small steps:

“A bakery sold 120 muffins on Monday. On Tuesday they sold 30% more than Monday. On Wednesday, half as many as Tuesday. On Thursday, 20 fewer than Wednesday. On Friday, the same as Monday. How many muffins did they sell in total over the five days?”

Without chain-of-thought, the model is handed the question and starts guessing the next word immediately, aiming straight at a final number. It gets one shot.

With chain-of-thought, the model is prompted to write its work down first. The same autocomplete loop from the start of this chapter runs; each new word gets picked by looking at everything already written. But now what gets written is scratch work that stays on the page:

It sees the question and writes “Monday: 120 muffins.”
It sees the question plus “Monday: 120 muffins” and writes “Tuesday: 30% more than 120 is 156.”
It sees the question plus those two lines and writes “Wednesday: half of 156 is 78.”
It continues line by line for Thursday and Friday, then writes the final total at the end: “Total: 120 + 156 + 78 + 58 + 120 = 532.”

There is no separate “reasoning module” anywhere in this. Every line comes out of the same next-word-guessing loop from the opening of the chapter. The only difference is that the model’s own earlier outputs get fed back in as input, so the arithmetic it wrote three lines ago becomes context that the next line’s guesses can rely on. Chain-of-thought is autocomplete writing its own scratch paper, then reading it back.

But “resembles” is doing work. In 2023, a study called Faith and Fate (Dziri et al. 2023) designed tasks where the models had to combine skills they had clearly learned separately into a new combination. One example setup: the model had learned to multiply two-digit numbers, and separately had learned to add numbers, and was then asked to do a task that required both (like computing a weighted average). Models failed in systematic ways that looked like sophisticated pattern-matching, not real step-by-step deduction. The following year, a benchmark called GSM-Symbolic (Mirzadeh et al. 2025) ran an even simpler experiment: take a standard math word problem, slide a single irrelevant sentence into it (something like “five of the apples were green”), and watch accuracy drop by up to 65% on the best available models. A model doing real symbolic reasoning would not care about a sentence that does not affect the answer. This one does.

The 2025 literature tightened the picture again. Apple’s The Illusion of Thinking (Shojaee et al. 2025) found that large reasoning models [the newer class of models, like OpenAI’s o3-mini and DeepSeek-R1, that write long internal scratch work before producing an answer] show a sharp accuracy cliff as tasks get more complex. They stay strong up to a threshold, then fall off. The paper attracted an immediate rebuttal (Lawsen 2025) arguing the cliff was not real. The models, the rebuttal said, were running into a limit on how many words they were allowed to output, and getting judged as failing when they had actually been cut off mid-solution. The rebuttal’s author has since walked back parts of it, but the back-and-forth is itself instructive. It shows how much “LLM reasoning research” is measuring the setup around the model rather than the model itself.

The honest read of this literature, as of April 2026, is this:

LLMs produce reasoning-shaped behavior whose reliability degrades as problems get longer and require more steps that depend on each other, or when the input looks very different from anything the model saw during training. Chain-of-thought helps for a while, but it cannot eliminate the failure regime entirely.

Why the disagreement teaches. Dziri and Mirzadeh locate the failures inside the model’s inference [the computation the model runs when given an input]. Apple’s work locates them in how that inference scales to harder problems. The Lawsen rebuttal locates them in the evaluation methodology. All three can be correct at different layers of what is actually happening. That “LLM reasoning” can fail at the internal-computation layer, the scaling-to-harder-problems layer, AND the measurement layer tells you the thing we are calling reasoning is not one thing.

Back to the paper summary. Hand the model a paper whose argument has several steps that build on each other, something like a formal proof, or any paper where each section depends on the one before it. The model will come back with a fluent, confident summary. The sentences will read like a real summary. They will use the right technical words in the right order.

If you check whether the summary actually traced the argument, though, you often find it did not. What the model produced was a summary shaped like summaries of papers like this one. In training, it learned that when you summarize a proof, you write things like “the authors establish X by first proving A, then combining A with B.” It can generate that sentence whether or not it actually worked out how A and B combine to yield X. The words come out right. The reasoning behind them may or may not exist.

For most papers, this is fine. Your summary is a rough sketch you will double-check anyway. For papers where the specific steps matter, a proof you are going to cite, a method you are going to reproduce, it is not fine. You need to know which kind of paper you handed it.

This is not a dunk on LLMs. They do something, and at today’s scale that something is genuinely useful. But “reasoning engine” suggests the kind of deliberate, step-by-step logic a proof checker runs, or a human mathematician works through on paper. That is not the mechanism. Treating it as the mechanism leads to specific blind spots: the model will confidently solve one geometry problem and confidently botch the near-identical next one. We spend real time on that failure mode in Chapter 15.

1.4 Not a database of facts

Ask a frontier model a factual question. It answers. Often correctly. Sometimes confidently wrong. That has led to the intuition that there is some kind of fact-store inside the model, and hallucination [the model saying confidently wrong things] is just a bug in the lookup.

The reality is stranger.

A 2025 OpenAI and Georgia Tech paper titled Why Language Models Hallucinate (Kalai et al. 2025) frames hallucination as a predictable consequence of how models are trained. During training and during standard evaluation, models are rewarded for producing plausible-looking outputs. Saying “I don’t know” scores worse than a confidently-wrong guess. So the training incentive quietly tilts the model toward confident guesses on questions it cannot actually answer reliably. The hallucination is not a lookup bug. It is structural. It falls out of the loss function [the numerical score that guides what the model learns during training].

Meta FAIR’s HalluLens benchmark (Ravichander et al. 2025) comes at it differently. Instead of explaining where hallucinations come from, it classifies the shapes they take:

Extrinsic hallucinations: outputs that cannot be grounded in any real source. Fabricating a paper citation when asked about real literature is the classic case.
Intrinsic hallucinations: outputs that directly contradict the context the model was given. You upload a paper that clearly states X, ask the model to summarize it, and the summary says not-X.

Both framings are useful, and they disagree about where the problem lives. Kalai explains why hallucination happens. HalluLens describes what it looks like. The Kalai frame applies equally to intrinsic and extrinsic cases (the training incentives do not distinguish). The HalluLens taxonomy applies to any generative system regardless of why it fails.

That both framings work tells us something important: hallucination is not one phenomenon. It is at minimum two. A push from how the model is trained (Kalai) that tilts it toward making things up, plus a way to tell different kinds of making-up apart (HalluLens) so different fixes can target each kind. A book that picked one frame and dismissed the other would let you believe hallucination is a single problem with a single fix. It is not.

There is a third angle worth naming. Anthropic’s Scaling Monosemanticity work (Templeton et al. 2024) found that some facts appear to live in identifiable internal circuits [small interconnected groups of components inside the model that together represent one specific concept, like “the Golden Gate Bridge” or “sycophancy”] rather than being smeared evenly across the weights. So the picture “facts are just statistical residue” is too flat. The better statement is:

Pulling a fact out of the weights is imperfect and tangled. There is no database inside the model that you can query. But the internal structure is richer than uniform smear. Some concepts and facts have identifiable locations in the model, and interpretability research [the subfield that tries to figure out what is happening inside a trained model] is only in the early days of mapping them.

Both can be true. The model is not a database, but it is not only a statistical blur either.

Practical upshot for the paper-summary task. Treat the weights as an unreliable oracle. If a fact matters to your summary (a specific number, a citation, a named method), pull it from the paper itself at summary time (Chapter 10) or have the model cite page numbers it can look at. Do not let the model reach into its weights for something it could read off the page. Interpretability research is fascinating and will matter for the next generation of systems; for today, ground anything load-bearing.

Benchmarks tracking hallucination rates update constantly. AA-Omniscience (Artificial Analysis 2025), Vectara’s leaderboard (Vectara 2026), and TruthfulQA all measure different things: breadth of knowledge across fields, whether summaries faithfully reflect the source, and trick questions designed to trip the model up. As of April 2026, AA-Omniscience’s top score is 33 out of 100. On a benchmark that rewards “I don’t know” and penalizes confident wrong answers, most frontier models score below zero. They hallucinate more than they answer correctly. That is the current state of the art.

1.5 What it is, instead

So: not a chatbot, not a personal assistant, not a reasoning engine, not a database of facts. What an LLM actually is can be stated in one sentence you already know:

An LLM is your phone’s autocomplete, at massive scale, guessing the next word in a loop.

Every word in that sentence does real work. A function [a machine that takes an input and produces an output]. Input is tokens. Output is a probability distribution [a ranking of how likely each possible next token is] over what comes next. Scale (much bigger vocabulary, much bigger training set, much bigger context). Loop (pick a token, append it, run the whole thing again).

The four misconceptions we cleared here are not random cultural noise. They stick because the product layer, the chat bubble, the memory feature, the confident voice, is built deliberately on top of a mechanism that is none of those things. Once you see the mechanism, the user experience becomes a different object: a carefully-designed wrapper that feels humanlike, layered over a powerful but alien thing.

Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.

Dziri, Nouha, Ximing Lu, Melanie Sclar, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2305.18654.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.

Lawsen, A. 2025. The Illusion of the Illusion of Thinking. arXiv:2506.09250. https://arxiv.org/abs/2506.09250.

Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” International Conference on Learning Representations (ICLR 2025). https://arxiv.org/abs/2410.05229.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2305.18290.

Ravichander, Abhilasha, Ellen Vitercik, Aditya Gupta, et al. 2025. “HalluLens: LLM Hallucination Benchmark.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://arxiv.org/abs/2504.17550.

Shojaee, Parshin, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research, arXiv:2506.06941. https://arxiv.org/abs/2506.06941.

Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.

Vectara. 2026. Hallucination Leaderboard (HHEM-2.3 + FaithJudge). GitHub, continuously updated. https://github.com/vectara/hallucination-leaderboard.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2201.11903.