Memory Systems | Working with Large Language Models | Synoros

The assistant does not remember you. A wrapper pastes your history back in.

That is the whole chapter in one sentence. Section 1.2 made the claim; the mechanism chapter showed why. A forward pass takes a sequence of tokens and returns a guess at the next one. Nothing stays behind. When you send your thirty-seventh message in a long conversation, the server was handed the full transcript again, all thirty-six earlier turns prepended to the new one, and it will throw the whole bundle away the moment it finishes typing its reply.

And yet every major LLM product in April 2026 advertises “memory.” ChatGPT remembers the coffee shop you asked about last Tuesday. Claude pulls up the codebase you described three weeks ago. Gemini holds two million tokens of context across calls, roughly four copies of War and Peace stacked end to end. The marketing copy uses the word “memory” with a straight face. This chapter reconciles those two facts.

The reconciliation is not subtle. All of that apparent memory is a piece of software sitting outside the model, picking things out of a store somewhere and pasting them into each new prompt before the model sees it. The interesting question is which piece of software, at which layer of the stack, on whose behalf, storing what, where. The word “memory” in LLM product discourse covers at least five distinct systems, built at different layers on different assumptions. A reader who hears “the model remembers” and does not ask which layer cannot debug memory bugs, cannot budget for them, and will be surprised when each one fails in its own particular way.

Hold on to the running example. You are summarizing that forty-page research paper. After three sessions, you want the assistant to remember which paper, which bits you already pulled out, and which sections you flagged as suspect. That single expectation touches four of the five systems we are about to name. Every time the chat window seems to “remember” your paper across sessions, some piece of software outside the model picked something and pasted it in.

14.1 Five things the word “memory” means

Pin the layer before using the word. We build the taxonomy from the inside out: start at the layer closest to the forward pass, move outward to what the user sees in the chat bubble.

Layer 1. The KV cache. The KV cache [key-value cache, a buffer inside the serving system that stores intermediate attention numbers for the current request so the model does not recompute them from scratch on every new token] lives and dies inside a single request. When the model processes your prompt, each attention layer produces a pair of vectors at every position, one called the key, one called the value. Generating the next token needs to attend over every position in the prompt so far; without a cache, that means re-running the whole attention computation from scratch on every new token. With a cache, the numbers from earlier positions stay in GPU memory and the model only computes the new column. PagedAttention (Kwon et al. 2023) is the canonical way to manage this cache in a multi-tenant server. When the request ends, the cache gets reclaimed. This is the shortest-lived “memory” in the stack, and it is not memory in the user sense at all. It is bookkeeping inside one forward pass. We flag it here because the word “cache” gets overloaded in the next few sections and it helps to keep the layers separate.

Hand the model your forty-page paper and ask for a summary. The KV cache fills up as the model reads the paper into its context. A second or two later, the summary comes back. Then the cache is gone. If you open a new chat two minutes later and paste the paper again, the cache starts from zero. Nothing at this layer crosses requests.

Layer 2. Context-window stuffing. The classical chat pattern: keep every prior turn in an array, send the whole array again on every request, let the model condition on it via in-context learning [the behavior where a model adapts to patterns in the current prompt without anyone changing the weights] (Brown et al. 2020). This is what the ChatGPT interface has been doing since day one. From the user’s seat, it looks like the assistant remembers what you said three turns ago. From the model’s seat, it is being handed a longer prompt than last time, the same as if you had typed all of that fresh. The mechanism is the one Section 1.1 took apart. Still, it is a cheap and effective way to fake continuity for as long as one session lasts. Our forty-page paper, once it is in the transcript, comes back around on every new turn as long as the session holds.

Layer 3. Retrieval-as-memory. When a conversation grows past what you want to resubmit on every turn (too expensive, too slow, or no longer fits in the context window), you switch from “send everything” to “send what seems relevant.” The operator turns past turns into embeddings [short lists of numbers that place a chunk of text at a point in meaning-space], stores them in a vector index [a database that ranks stored chunks by how close their numbers are to a query chunk’s numbers], embeds the new user turn as another point, and pulls the top few closest past snippets into the prompt. This is the same retrieval-augmented generation pipeline from Chapter 10, applied to a conversation history instead of a document corpus. Same mechanism, different source of truth. For the paper workflow: after a week of sessions, instead of resubmitting every prior turn, the wrapper pulls back the three or four past turns whose embeddings sit closest to whatever you just asked.

Layer 4. External store with agent-driven read and write. The model does not retrieve. The model writes to a store (summaries, notes, facts) and later reads from it through tool calls. The store lives in a database, a file, or a structured document. The harness exposes read and write as tools the model can invoke the way it invokes a calculator. MemGPT (Packer et al. 2023) is the canonical research framing, built around the metaphor of a virtual memory system with pages swapped in and out of a main context by the model itself. In the paper workflow, this looks like the model choosing to save a note (“section 4’s proof sketch skips a case”) when it notices something worth saving, and choosing to read that note back when a later question touches section 4.

Layer 5. Vendor product memory. ChatGPT memory, Claude memory, Gemini context caching. Each of these is a product-layer feature sitting on top of layer 3 or layer 4 or a hybrid. The user experience is “the assistant knows things about me across sessions.” The implementation is some combination of a user-specific data store, a retrieval step, and an injection step that happens before the model sees your message. Each vendor placed a different architectural bet on what goes in the store, who can see it, who can edit it, and when it gets pulled into the forward pass. We spend most of this chapter on layer 5 because it is where the pluralism matters.

Name the layer before using the word. “The model remembered” is operator-hostile phrasing. “The harness retrieved three entries from the user’s memory store and appended them to the system prompt” is the phrasing you want. The first hides the bug. The second tells you which piece to inspect.

These five layers are not competing theories. They are cooperating pieces of the same stack. A production chat product typically runs layer 1 inside its serving engine, uses layer 2 for the current session, layer 3 for session-spanning history, and layer 5 for explicit user-memory features. Layer 4 shows up when operators want the model to actively curate what is remembered, which is a different workload from “remember the conversation so far.”

What they share is the thing we will keep returning to: the model itself is stateless. Every layer is a different pattern for putting tokens into the next forward pass. The differences are about which tokens, chosen by whom, selected how, owned by whom. Ownership and inspectability are doing more work in the 2026 landscape than any of the engineering details.

14.2 Why continuity is faked, not built in

This section is the operational restatement of Section 1.2. Chapter 1 argued the point from the outside, in how API calls look on the wire. The cognitive-model chapter argued it from the inside, in the function that runs when the server handles a request. Here we turn it into an engineering claim for this chapter: there is no hook anywhere inside a transformer for persisting state between requests, and every production “memory” feature is a layer outside the forward pass, staging tokens.

The weights are frozen after training (Ouyang et al. 2022). Nothing you say to the model in February 2026 nudges any weight. In-context learning (Brown et al. 2020) lets the model behave as if it learned from the current conversation, but that behavior lasts exactly as long as the tokens representing that conversation sit in the context. Drop those tokens from the next request and the learning is gone. The model does not notice it is gone. It has nothing to notice with.

Picture what this means for the paper summary. You spent three sessions building up a rich understanding with the assistant: which sections to trust, which to double-check, which to skip. None of that stuck to the model. It all lived in the transcript. If the wrapper trims the transcript to fit the context budget next Tuesday, the understanding evaporates along with the tokens. Not because the model forgot. Because the tokens were never the model. They were the model’s only input.

So when a vendor ships a feature called “memory,” you can ask five concrete engineering questions and fully characterize the architecture:

What goes into the store, and when?
What format is the store?
Who can read it?
How is it selected for injection into the next forward pass?
Who can edit or delete it?

Every memory system in this chapter is describable as a specific set of answers to those five questions. The different vendor bets are different answers.

14.3 Scratchpads and agent-driven memory

MemGPT (Packer et al. 2023) reframed the problem as an analogy to operating-system virtual memory. The model has a fast but small “main context” (the current prompt) and a slow but large “archival store” (an external database). When the main context fills up, older material gets paged out to the archive. When the model needs something from the archive, it issues a read. The harness hides the paging mechanics behind tools the model sees as ordinary function calls. Write and read, save and recall, the same verbs you would use about a notebook.

Here is what that looks like in practice. Say you hand a MemGPT-style agent a stack of thirty research papers and ask it to compare methodologies across them. Session one, paper three: the model reads the paper, decides one of its design choices is relevant for cross-comparison, and issues memory.save("paper-03", "uses bootstrap resampling with n=1000 on a non-stationary time series"). Session four, paper eighteen: a different paper, different methods, and the model (prompted to check for conflicts with what it has seen before) issues memory.recall("bootstrap on time series"). The harness runs a similarity search over the saved notes and hands back the entry about paper three. The model now has, in its current context, exactly the prior note it needs to flag that paper eighteen is doing the same questionable thing paper three did. No “memory module” anywhere in this. The store is a database. The retrieval is similarity search. The reasoning happens in the model’s next forward pass over a prompt that now includes both notes.

The authors later productized the method as Letta (letta.com). The 2023 paper is the research citation; Letta is what operators run in 2026 if they want this pattern off the shelf. The rebrand matters for literature searches, because the research corpus mixes both names: MemGPT as a research idea, Letta as an active product.

Two post-MemGPT papers extend the pattern along different axes. A-MEM (Xu et al. 2025) organizes the store as a zettelkasten [a note-taking system, popularized by the sociologist Niklas Luhmann, where every note is a small atomic unit with explicit links to related notes], with links between notes that the model itself writes and revises. The intuition: a flat store asks the model to do all the retrieval work at read time; a linked store lets the model do some of that work when it writes, by building relations as they become visible. Mem0 (Chhikara and Others 2025) takes a more production-engineering angle, layering extraction, consolidation, and retrieval steps with a graph variant for more complex relational queries. Mem0’s paper reports large token savings against context-stuffing and a 26 percent quality improvement against a baseline OpenAI memory configuration, measured with an LLM-as-judge.

A note on that last number. Mem0 is a commercial product pitched by a company with reasons to report favorable numbers, and the evaluation uses a model-as-judge on their own benchmark. Independent replication on a neutral benchmark is not yet in the literature as of April 2026. Run Sagan’s baloney-detection questions on the 26 percent: is it reproducible (not yet), does a simpler explanation fit (maybe the baseline was weak), who benefits from it being true (the company publishing the paper). None of those checks kill the claim. All three demand we hold it loosely. We cite Mem0 because it is the most visible production-grade implementation and because the paper documents a specific pipeline honestly, not because the 26 percent figure is the settled comparison. We do not fully know how much of that headline number would survive an outside run. We suspect not all of it.

Vendor-aligned memory papers deserve a second look at the benchmark. When the authors also sell the product, the burden of independent replication shifts to the reader. Treat self-reported numbers as upper bounds and discount them until someone without skin in the game runs the same experiment on a neutral corpus.

Scratchpad-style memory is not the default in chat products, and there is a reason. It asks the model to manage memory deliberately, which means extra tool calls, extra round trips, and a harness that enforces the schema of the store. For a chatbot where the average user never explicitly says “remember this,” the model ends up doing memory-management work the user never asked for, and the tool calls themselves cost tokens. Scratchpads shine in agent workloads where the model runs for hours over a continuous task, which is why Letta’s customer base skews toward autonomous agents rather than consumer chat.

Apply this to the paper-summary task. If the chat session is a single sitting with one paper, you do not need a scratchpad; resubmit the transcript. If a research assistant agent is reading thirty papers over a week, saving structured notes per paper and later pulling them back to compare methodologies, the scratchpad pattern earns its complexity. Same task, different cardinality, different memory pattern.

14.4 Vendor memory features, architecturally

The three major frontier vendors have shipped memory features that look similar in the marketing copy and differ sharply in the engineering. The differences matter because each one expresses a different bet about who should own persistent user state, how inspectable it should be, and what the operator’s relationship to that state looks like. We take each in turn and then step back to see what the pluralism teaches.

Keep the paper in mind while reading these three. The same concrete expectation, “the assistant should remember which sections of my paper I flagged as suspect three sessions ago,” produces three sharply different answers depending on which vendor you are talking to. That is the pedagogical payoff.

14.4.1 ChatGPT memory

OpenAI shipped ChatGPT memory in February 2024 (OpenAI 2024). The original version had two visible pieces: “Saved memories,” explicit items the user or the model had promoted to a persistent list, and a more diffuse “chat history” layer the assistant could reference in general terms. In April 2025, OpenAI upgraded the feature so ChatGPT could reference the contents of all past conversations for a logged-in user, not just the explicit saved items. In June 2025, the full feature reached the free tier.

Architecturally, ChatGPT memory is a vector-indexed store of user-specific content, queried by the current conversation’s embedding, with results pasted into the system prompt before the model sees the user’s current turn. The user can view a list of saved items in settings and delete individual entries. The broader “chat history” layer is not directly inspectable. What gets retrieved and pasted on any particular turn is opaque; the user sees the assistant’s behavior but not the staged context that produced it.

Run the paper task through this. Three sessions ago, you told ChatGPT “section 4’s proof sketch is suspicious, skip it when you summarize.” You did not save anything explicitly. Next Tuesday, you open a new chat, ask for a fresh summary of the same paper, and get back something that carefully works around section 4. From your seat it feels like the assistant knew. What happened under the hood: your new message’s embedding landed near your old message about section 4, the retrieval layer pulled the old turn in, the system prompt quietly grew a paragraph you could not see, and the model conditioned on that paragraph. If the retrieval misses, you will never know why.

OpenAI’s bet is on frictionlessness. The default user experience is that the assistant quietly gets better at knowing you over time. The cost of that default is that the user cannot cleanly see what the assistant knows, and OpenAI, not the user, makes editorial decisions about what is worth remembering.

14.4.2 Claude memory

Anthropic took a different approach. Claude memory (Anthropic 2025a) stores persistent context as human-readable markdown [a plain-text format with light punctuation marks that render as headings, lists, and emphasis] files, broadly in the shape of the CLAUDE.md files Claude Code has used as project context since 2024. A user’s memory is a set of text files the user can read, edit, delete, and version. The model reads them at the start of a relevant session and can write to them when asked.

The rollout was staged. Team and Enterprise plans got memory in September 2025. Pro and Max plans followed in October 2025. Free-tier access landed in March 2026. At each stage, the memory store kept the same shape: markdown files, owned by the user, organized by the user or the model.

There is also a separate API-level capability, the memory tool (Anthropic 2025b), which exposes read and write operations as tools the model can invoke explicitly. The store in that case is caller-managed: the operator decides where the files live, what the retention policy is, and which tool calls the model is permitted. Architecturally this sits closer to MemGPT’s scratchpad pattern than to ChatGPT’s vector store.

Run the paper task through this. You tell Claude “remember that section 4 of this paper is suspicious.” Claude writes a line into a file called something like paper-notes.md: “Section 4’s proof sketch skips a case. Skip when summarizing.” You can open that file, read that line, edit it, delete it, share it with a colleague. Next Tuesday, Claude reads the file at session start, sees the line, and avoids section 4. If it misses, you know where to look. If it gets the line wrong, you can fix the line.

Anthropic’s bet is on inspectability and portability. Memory lives in plain text the user owns. The downside is friction: users who do not curate end up with sparser memory than the opaque-vector approach would give them automatically.

14.4.3 Gemini context caching

Google’s entry is a different beast. Context caching (Google 2026) is not a user-memory feature in the ChatGPT or Claude sense at all. It is a cost-latency optimization for repeated context within a short window. The operator (not the user) uploads a block of context, receives a handle, and references that handle on later requests. Cached tokens are billed at roughly 10 percent of the normal input price. As of April 2026, Gemini 2.5 Pro supports caching up to two million tokens, enough to hold the collected works of Shakespeare with room for a small library of supporting material around it.

There is also an implicit caching layer that picks up common prefixes automatically, without the operator managing the handle. Functionally, this is a serving-system KV cache exposed as a user-visible API, with price and expiration as first-class concepts.

Run the paper task through this. You upload your forty-page paper, get back a cache handle, and then ask twenty questions about it over the next hour. Each question pays the 10 percent cached rate on the paper’s tokens plus full rate on the new question. Pennies per query, fast. But stop for four hours and come back: the cache has expired, you re-upload, you pay full rate on the re-upload. This is not memory. It is a receipt for last hour’s upload. It does not know you flagged section 4. It does not know your name. It knows it saw these tokens recently and can charge you less to re-attend to them.

Google is placing a different bet: long context plus cheap caching makes per-user memory features unnecessary for a large class of workloads. If an application can load an entire user’s document corpus into a two-million-token context and ask questions of it for pennies per query, the case for a separate memory system gets weaker. Whether the case holds up depends on effective-context measurements we revisit in Section 14.6.

Calling Gemini context caching “memory” is a stretch in the architectural sense. It is memory in the marketing sense because the user-visible effect (cheap repeat access to prior context) feels similar. Pinning this to layer 5 of the taxonomy is generous. We mention it because the vendor and press both treat it as part of the memory landscape, and pretending otherwise would be dishonest.

Three vendors, three bets. OpenAI is betting that users want memory without work. Anthropic is betting that users want memory they can see and edit. Google is betting that if context is cheap enough, memory becomes a non-problem. These are not small differences of UI. They are different answers to who owns your state and how inspectable it should be. Treat the divergence as the pedagogical point, not noise.

14.4.4 The pluralism is the lesson

A tempting move, writing this chapter, would be to rank the three approaches. We are not going to. Each bet is coherent, each has failure modes the others do not, and the literature has not produced a rigorous head-to-head evaluation of ChatGPT memory, Claude memory, and Gemini caching on a shared benchmark. Claims that one vendor is “better” at memory than another, as of April 2026, are arguments about user experience or about per-vendor self-reported numbers.

This is a genuine open problem in the field. No peer-reviewed paper compares vendor memory features on the same workload with independent evaluation. The Mem0 paper comes closest (Chhikara and Others 2025) and pits their own system against an OpenAI baseline, which is not the cross-vendor study we want. A first-party Synoros benchmark that runs the same paper-summary workload against all three vendors, plus a context-stuffing baseline and a scratchpad baseline, on the same local model’s back end where possible, would be a meaningful contribution. We flag that as open work.

14.5 MCP memory and memory as a tool

MCP (Chapter 12) gives operators another place to put memory. A “memory server” exposed over MCP looks, to the model, like any other tool. Calls take the shape memory.save(key, value) or memory.recall(query). The store behind those calls can be anything: a vector index, a SQLite database, a flat file. The tools are just the interface.

Several reference memory servers exist in the MCP ecosystem, typically implementing an opaque vector store with save and search endpoints. The Anthropic memory tool (Anthropic 2025b) is the same pattern, surfaced as a first-party API capability rather than via MCP. In both cases, the design decision is to let the model explicitly decide what to save and when to recall.

This is a different shape from ChatGPT’s or Claude’s user-memory features. Those are automatic: the system decides what to remember. MCP memory tools are manual: the model decides what to remember, constrained by whatever the harness and the prompt push it to do. Both are legitimate; they fit different workloads. Automatic memory fits consumer chat where the user should not have to think about curation. Manual memory fits agent workloads where the model is managing long-running state deliberately.

Take the paper workflow again. If a research-assistant agent with an MCP memory server is reading fifty papers over a literature review, it decides, paper by paper, what to save: a methodology note here, a results summary there, a flag when something contradicts an earlier paper. When the model later asks itself “has anyone else in this corpus run a bootstrap on a non-stationary series,” it issues a recall call, gets the relevant prior notes back, and conditions its next answer on them. The same pattern as the scratchpad section, but standardized over a transport that any model, any tool, any harness can speak.

The one thing operators should not do is turn on both simultaneously without knowing what each is writing. We have seen production systems end up with the ChatGPT-style opaque store and an MCP memory server, both trying to remember the same things, producing double-injection and quiet contradictions when the two stores disagreed. The fix is always the same: name the layer, decide who owns what, delete the second one.

14.6 RAG against long context, for memory workloads

The live architectural debate is whether memory should be a retrieval problem or a context-window problem. One school says: stuff the user’s full history into every call using a long context window and let attention sort it out. The other school says: retrieve the relevant bits and only stuff those. The first school points at Gemini 2.5 Pro with its two-million-token window, roughly the combined text of the Harry Potter and Lord of the Rings series read end to end. The second school points at lost-in-the-middle [a well-documented failure mode in which transformers attend most reliably to the start and end of a long context and least reliably to the middle] (Liu et al. 2024) and at the cost of two million input tokens per request.

Both schools have a point.

The long-context case has four strong arguments. It sidesteps retrieval-quality failures entirely, because there is no retriever. It keeps all context visible to attention, which matters for queries that cross-reference many parts of the history. With explicit caching, the per-call cost drops to roughly 10 percent of naive long-context pricing, which narrows the cost gap sharply. And from an engineering-simplicity standpoint, removing the retrieval pipeline removes an entire category of bugs.

The retrieval case has four strong arguments in return. Advertised context length is not usable context length; lost-in-the-middle (Liu et al. 2024) remains a measurable effect in 2026 evaluations, and the middle of a long context is where dated information tends to live. Even at 10 percent of input price, two million tokens per call multiplies up for heavy users. Retrieval lets you do hybrid scoring and reranking in ways attention alone cannot. And for privacy and auditability, retrieving a few chunks tells a far cleaner story than “we loaded your entire history into every call.”

The honest synthesis, as of April 2026, is that the right answer depends on workload. For a single-document agent reading a long contract and answering many questions about it, long-context with caching is the better default. Retrieval adds complexity for little benefit; the whole document is the memory, it fits, and caching makes the cost tolerable. For a consumer chat assistant working across years of a user’s conversation history, retrieval is the better default. The history is too large, lost-in-the-middle is real, and retrieval lets you rank by relevance in ways attention cannot cheaply replicate. Mixed systems are common: cache the document, retrieve over the history. Your forty-page paper summary sits squarely in the first camp. The paper fits in the window, questions about it benefit from attention seeing the whole thing, and caching makes repeat queries cheap. Build a custom retriever over it only when you outgrow one paper.

“RAG vs long context” is a workload question, not a religion. For bounded corpora where the whole thing fits, long context plus caching is simpler and often cheaper. For unbounded histories where relevance varies sharply by query, retrieval is the only pattern that scales. The right answer is whichever pattern matches your workload’s cardinality and access shape.

One more systems note. The lost-in-the-middle effect (Liu et al. 2024) is not a bug the vendors forgot to fix. It is a property of how attention distributes weight across positions, documented repeatedly in 2023, 2024, and 2025 evaluations across model families. It has eased somewhat in newer models with better long-context training data and position-encoding refinements, but it has not disappeared. We do not fully know why the middle suffers more than the edges even in newer architectures. The research consensus is that it comes from attention and positional encodings interacting, not a bug any one vendor can patch. The 2026 rule of thumb is that a model with a two-million-token window usefully retrieves, for a single question, from the range where attention still concentrates, which tends to be much smaller than the advertised window. Measure your workload.

14.7 An opinionated operator heuristic

Having laid out the space, the chapter earns a recommendation. Here is ours.

If you are building an LLM-backed application in 2026 and you are asking yourself “do I need a memory system,” the right first question is which workload you are building for. The following heuristic covers the common cases without pretending to cover all of them.

Session memory only. If your workload is one conversation at a time, no cross-session continuity, use layer 2. Resubmit the transcript. Trim oldest turns when you hit the context budget. Do not build anything fancier until you have evidence it would help.

Cross-session user memory. If your users expect the assistant to remember things about them across sessions, pick between two patterns. Opaque-vector (ChatGPT-style) if your users are consumer-shaped and will not curate; markdown-file (Claude-style) if your users are developer-shaped and want to see and edit what is remembered. If you are building against a vendor API, use the vendor’s feature. If you are building against a local model, a markdown-file pattern is easier to debug and cheaper to run than a vector store.

Bounded reference corpus. If the memory is a fixed document or set of documents (a legal contract, a product manual, a user’s project), use long context with explicit caching when the provider supports it. For providers without caching, chunk and retrieve. Do not build a custom “memory system” for this; it is RAG over a document, not memory.

Autonomous agent with long-running state. Use the scratchpad pattern (MemGPT / Letta / A-MEM / Mem0). Automatic user-memory features do not fit here because you want the model to be deliberate about what it remembers, and you want the memory schema to be inspectable for debugging.

Mixed. Most real production systems are mixed. That is fine. The rule is that each memory source has a named layer, a named owner, and a named injection point in the prompt. If you cannot sketch that on a napkin, you have not designed the memory system yet; you have accumulated one.

Run the paper task through this heuristic once, as practice. One session reading one paper: layer 2, resubmit. Multiple sessions on the same paper over a week, where you want the assistant to remember which sections you flagged: long context plus caching beats a custom memory system. An agent reading fifty papers for a literature review and comparing methodologies across them: the scratchpad pattern earns its keep. Same verb (“summarize the paper”), three different memory architectures, picked by cardinality and access shape.

The memory system is an architectural decision, not a feature toggle. The vendor UI makes memory look like a switch you flip. Architecturally it is closer to a schema choice: what goes in, what comes out, who owns it, when it gets staged. Treat it like a schema. Design it before you build against it.

14.8 Why this chapter closes Part III

Part III has walked the operator’s augmentation stack: retrieval in Chapter 10, chunking in Chapter 11, tool use and MCP in Chapter 12, orchestration in Chapter 13, and now memory. The chapters cover different surfaces of the same underlying pattern. The model is a stateless function. Everything useful we do on top of it is a way of deciding which tokens land in the next forward pass. Retrieval chooses which documents. Chunking chooses how documents are sliced. Tool use extends the context with function outputs. Orchestration decides when and how to call the model at all. Memory decides which pieces of the user’s history come back on a later call.

Seen this way, memory is not a separate problem from retrieval, tool use, and orchestration. It is the same family of operation applied to a different substance: the user’s own past. ChatGPT memory is a retrieval pipeline over a user-specific store. Claude memory is a file-reading tool the model invokes at session start. Gemini caching is a serving-system optimization dressed as a product. Scratchpad memory is tool use with a conventional schema. MCP memory is tool use over a standardized transport. In every case the mechanism underneath is one of the four we have already built. Different substances, one family of operation.

That is the payoff for walking Part III in this order. By the time you reach memory, you have all the parts. The job here was to make visible that memory in the 2026 product landscape is not a new thing. It is a specific arrangement of existing things, and the vendor divergence reflects different bets about how to arrange them.

Anthropic. 2025a. Memory for Claude (Team, Enterprise, Pro, Max). Anthropic documentation + announcements. https://www.anthropic.com/news/memory.

Anthropic. 2025b. Memory Tool (Claude API). Claude API documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.

Chhikara, Prateek, and Others. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. https://arxiv.org/abs/2504.19413.

Google. 2026. Context Caching (Gemini API). Google AI for Developers documentation. https://ai.google.dev/gemini-api/docs/caching.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

OpenAI. 2024. Memory and New Controls for ChatGPT. OpenAI News. https://openai.com/index/memory-and-new-controls-for-chatgpt/.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.

Packer, Charles, Sarah Wooders, Kevin Lin, et al. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://arxiv.org/abs/2310.08560.

Xu, Wujiang, Zujie Liang, Kai Mei, and Others. 2025. “A-MEM: Agentic Memory for LLM Agents.” Advances in Neural Information Processing Systems 38 (NeurIPS 2025). https://arxiv.org/abs/2502.12110.