Part III — Agency & Augmentation

RAG Without the Buzzwords

If the model cannot be trusted to remember a fact, paste the fact into the prompt every time you ask.

Here is the whole chapter in one sentence you already know how to use. Section 1.4 told you the model is not a database of facts. Section 1.1 told you it is a stateless function over a context window. String those two together and the consequence writes itself: if a fact matters to your answer, and you cannot trust the weights to produce it reliably, the only move left is to fetch the fact at request time and paste it into the prompt before the model runs. That is retrieval-augmented generation. Everything else in this chapter is plumbing.

The acronym “RAG” has picked up a lot of product-marketing barnacles since Lewis and colleagues introduced it in 2020 (Lewis et al. 2020). Vendors sell “RAG platforms.” Consultants argue about “semantic search magic.” People compare vector databases the way they used to compare NoSQL engines. Every time a model’s context window grows, somebody declares RAG obsolete. Ignore all of it. The mechanism is boring. The engineering tradeoffs are concrete. There is nothing mystical about cosine similarity.

Stay with the running example. You have a forty-page research paper open on your screen and you want a five-bullet summary. But the paper leans on dozens of earlier papers for its argument. Some of those citations are load-bearing. A summary that gets them wrong is worse than no summary at all, because it invents a story that sounds right. The only way to produce a faithful summary is to pull the right passage from the right cited paper at the right moment, and drop it into the model’s view before the model writes the next word. That is a retrieval problem. It is this chapter’s problem.

10.1 Why retrieval at all

Picture the model the way Chapter 2 drew it. A sequence of tokens goes in. A probability distribution over the next token comes out (Section 3.1). The weights [the trained numbers that make the model work] froze months ago, at training time. They came out lossy and tangled (Templeton et al. 2024), and they were tuned against a loss function that rewards confident guesses over saying “I don’t know” (Kalai et al. 2025).

Now point that machine at your forty-page paper. Ask it: “What does reference 17 actually claim?” Reference 17 is a 2019 paper on a specific benchmark. The model was trained, at best, on an old copy of the internet that may or may not have included that paper. The weights do not hold that paper cleanly. They hold a lossy, blurred impression of millions of papers that look like it, and a training incentive that pushes them to guess rather than abstain. You are asking an oracle that is not allowed to admit ignorance to recall a specific fact it does not reliably know.

There are only two ways out. One is to change the weights: fine-tune the model on the paper and its references, or keep pre-training it on your own corpus. That bakes the facts back in, at the same lossy resolution as the rest of the model’s knowledge, at real cost, and it has to be redone every time the corpus changes. The other is to change the input: find the relevant piece at request time, and paste it into the context window [the block of prior text the model can see at once] before the model starts guessing the next word. RAG is the second path. That is the entire idea.

The distinction matters because it tells you which problems RAG can fix and which it cannot. HalluLens (Ravichander et al. 2025) splits hallucinations into two kinds. Extrinsic hallucinations are answers you cannot ground in any real source. The model invents a paper, a date, a quote. Intrinsic hallucinations contradict the context the model was given. You hand it a passage saying X, and its summary says not-X. RAG fixes the extrinsic kind by design. If the ground truth sits in the prompt, the model does not have to invent anything. RAG does nothing for the intrinsic kind. Once the passage is in front of the model, the model can still ignore it or garble it. We come back to that failure at the end of the chapter.

The shape of the problem is narrow and specific. You need to ship the right one to five thousand tokens into the context window, on demand, from a corpus too large to fit whole, for every request. One to five thousand tokens is roughly one to ten pages of prose, the size of a short chapter or a couple of academic pages, out of a corpus that might be ten thousand pages. You need the right chapter, not any chapter. That is the whole job. Vector databases, embedding models, rerankers, query rewriters, and the rest of the zoo are techniques for doing that job well.

RAG is not a feature, it is a shape of prompt assembly. The model does not know it was “RAG’d.” It sees a context window with some retrieved passages in it, the same way it sees a system prompt or a conversation history. Everything upstream of the forward pass is your engineering. This reframe is what Section 1.4 was setting up. When a fact matters, fetch it at runtime, then stage it into the context the model sees.

Gao and colleagues’ survey (Gao et al. 2024) sorts the field into three buckets. Naive RAG embeds the query, retrieves the top K passages, stuffs them into the prompt. Advanced RAG adds pre- and post-retrieval cleanups like query rewriting and reranking. Modular RAG builds routers, runs retrieval in multiple stages, lets the model retrieve again based on what it just read. The buckets matter less than the observation that every system in all three is still doing the same thing: assembling a prompt before the forward pass. The paper-summary task sits squarely in this frame. You need to turn “summarize this paper, faithfully, including what it says about reference 17” into “here is the paper, here is the relevant passage from reference 17, now write the summary.” The buckets differ in how much work the pipeline does before the model sees anything.

10.2 Embeddings and cosine similarity

Before we can retrieve the right passage, we have to tell a computer what “right” means. Keyword matching gets you partway. Maria opens your forty-page paper and asks, “what is the authors’ refund policy for the dataset access?” One of the paper’s sections contains the word “refund.” A plain string search surfaces it. Done.

But the section might say “returns policy” instead, or “compensation scheme,” or “we refunded participants who withdrew.” Keyword search misses all three. A human reader sees those as the same idea. A string-matching computer does not. We need matches based on meaning, not spelling.

The fix, which has been kicking around since word2vec (Section 2.5) and grew up during the BERT era, is to represent each piece of text as a list of numbers, arranged so that texts with similar meanings get lists that point in similar directions. Meaning gets coordinates. Similar coordinates, similar meaning. That is the whole trick.

10.2.1 What an embedding is

An embedding is a vector representation of meaning. A vector, from Section 3.2, is a list of numbers. A typical modern text embedding is a list of between 384 and 4,096 real numbers, produced by running the text through a neural network that was trained to make that list useful for comparing similarity.

“cat” might come out as [0.021, -0.334, 0.187, ..., 0.049]. “kitten” comes out as some other list, different in detail but pointing roughly the same way in the 4,096-dimensional space. “refrigerator” comes out pointing somewhere else entirely. You cannot picture a 4,096-dimensional space. You do not need to. The model cares about one property, and the property is directional: do these two lists of numbers point in similar directions. If yes, similar meaning. If no, different meaning.

Two pointing at right angles [orthogonal, unrelated] encode unrelated meanings. Two pointing in opposite directions would encode opposite meanings, but in practice that case rarely shows up for real text. Most real passages land in a narrow cone of the embedding space, which is why the interesting question is not “similar or opposite” but “how similar.”

Back to the paper-summary task. Chop your forty-page paper into, say, eighty chunks of roughly a page each. Run each chunk through an embedding model. You now have eighty vectors, each a list of a few thousand numbers, sitting in a file somewhere. The chunk about dataset collection sits near the chunk about ethics review, because the embedding model has learned that dataset ethics and dataset collection live near each other in meaning-space. The chunk about the 2019 benchmark sits near the chunks of related work from other papers that discuss the same benchmark. You have turned a forty-page paper into a geometry problem.

Reimers and Gurevych’s Sentence-BERT (Reimers and Gurevych 2019) made this practical. Before Sentence-BERT, getting a sentence embedding out of BERT required a cross-attention forward pass per pair of sentences. Comparing one query against ten thousand documents meant running BERT ten thousand times. That took hours on good hardware. Sentence-BERT trained a siamese network, two twin encoders that each produce one embedding per sentence independently. You embed the ten thousand documents once, store the vectors, then compare any new query against all of them with ten thousand dot products [a cheap arithmetic step between two lists of numbers] against a precomputed index. Hours became milliseconds. That shift, from bi-encoders [encoding query and documents separately, comparing afterward] with precomputed indexes to query-time comparison, is why semantic search at scale exists.

10.2.2 Cosine similarity, without the algebra

Cosine similarity measures how close two vectors point in the same direction, ignoring their lengths. Picture two arrows drawn from the same point. If they overlap perfectly, cosine similarity is 1. If they sit at right angles, it is 0. If they point in opposite directions, it is -1. Length does not count. Only direction.

In practice, text embeddings live in a regime where almost all cosine similarities come out positive. The interesting distinction is between 0.2 (loosely related) and 0.8 (strongly related). The number has no absolute meaning. A cosine similarity of 0.4 is small for some embedding models and respectable for others. What matters is the ranking: which of these eighty chunks has the highest similarity to the query.

Mechanically, cosine similarity divides the dot product of two vectors by the product of their lengths. The division cancels out magnitude. You are left with a pure measure of alignment. Most modern embedding models already output unit-length vectors, which means cosine similarity collapses to a plain dot product and runs on the same hardware that runs the model itself. The whole retrieval step, in its simplest form, comes down to a matrix multiply.

The operator-facing consequences:

  • Rank your chunks by cosine similarity to the query’s embedding, keep the top K. That is the entire retrieval step in the simplest version of the pipeline. For the paper-summary task, you embed the query “what does reference 17 claim,” rank the eighty chunks by similarity, keep the top five.
  • Retrieval quality is bounded by embedding quality. If your domain contains jargon the embedding model has never seen, the vectors will fail to separate meanings that actually differ. Your top-K will be noise. A paper on a 2026 biology technique may contain terms the embedding model was last updated before.
  • Karpukhin and colleagues’ Dense Passage Retrieval (Karpukhin et al. 2020) showed that a dual-encoder trained on question-answer pairs beats traditional keyword retrieval by 9 to 19 percent on open-domain question-answering benchmarks. That result seeded a decade of dense-retrieval research. It is also, as the next section shows, a partial result: it held on the distribution DPR was trained against. It did not hold everywhere.

10.2.3 Choosing an embedding model

As of April 2026, the frontier closed-source embedding models (Voyage-3, Gemini Embedding, OpenAI’s text-embedding-3-large) rank near the top on public benchmarks but do not publish weights. The strongest openly available families are BGE and E5, both from research labs, with smaller dimensionality and comparable quality on MTEB [Massive Text Embedding Benchmark, the community leaderboard for embedding evaluation].

Three practical points will save you weeks:

  1. Use MTEB as a starting point, not an oracle. MTEB measures performance across a mix of tasks and domains that is probably not your mix. Pick the top three candidates from the leaderboard. Then test them on a fifty-example labeled sample from your own corpus before committing. If your corpus is forty-page academic papers, test on fifty real queries against real papers. If it is customer-service logs, test on fifty real customer turns. The leaderboard tells you who is in the running. Your own data tells you who wins.
  2. Embedding dimensionality trades storage for quality, weakly. A 1,024-dimensional embedding does not carry meaningfully more useful information than a 768-dimensional embedding for most corpora, but it doubles your vector-store cost. Default to the smaller one unless your own measurement tells you otherwise.
  3. Do not mix embedding families in the same index. A cosine similarity computed between a Voyage-3 vector and a BGE vector is noise. They were trained against different objectives, in different coordinate systems. If you change embedding models, you re-embed the whole corpus. That is one of the larger practical costs of changing your mind later.

We are deliberately not recommending a specific vector database. The comparison between Pinecone, Weaviate, Milvus, Qdrant, pgvector, and the rest is a real engineering decision. It depends on your scale, your operational preferences, and your existing stack. It ages poorly in print, and a book recommending a specific product today will embarrass itself in eighteen months. Treat vector storage the way you treat any other indexed key-value store. Pick the one you can operate. Move on.

10.3 Hybrid retrieval: BM25 still matters

Here is the unfashionable truth about dense retrieval. On a large fraction of real-world corpora, a 1990s-era keyword algorithm still beats it.

BM25 (Best Match 25) has served as the default ranking function for keyword search in Lucene, Elasticsearch, Solr, and Postgres full-text for thirty years. The algorithm is older than most people working in machine learning today. Robertson and Zaragoza’s monograph (Robertson and Zaragoza 2009) is the reference. The details involve term frequencies, document lengths, and inverse document frequency [IDF, a weight that punishes common words and rewards rare ones]. The intuition is simpler. Score a document by how many of the query’s rare words appear in it, weighted by how much each word matters, and discounted by how long the document is. A document that mentions a rare query word once, in ten lines, scores higher than one that mentions it twice in two thousand lines.

BM25 excels exactly where dense retrieval struggles. Exact-term matching. Rare technical tokens. Product SKUs, legal citation numbers, proper nouns that did not appear often in the embedding model’s training. Picture what happens with an unusual identifier: say your paper cites “SKU-7734B” as a dataset version. An embedding model that has never seen that token will produce a generic, vague vector for it, a vector that looks a lot like every other short code it has half-seen. Cosine similarity will surface every chunk that contains any short code, useful or not. BM25 treats “SKU-7734B” as a rare string with high IDF weight and finds the single chunk that contains it instantly. The paper-summary task hits this case often. A citation like “[Smith 2019]” is a rare token. A DOI is a rare token. A specific proof name, a specific dataset version, a specific parameter value like “N=1024,” are all rare tokens where BM25 wins.

Thakur and colleagues’ BEIR benchmark (Thakur et al. 2021) evaluated dense retrievers and BM25 across eighteen diverse datasets. The headline result was inconvenient for the dense-retrieval story. BM25 is a surprisingly strong baseline, and dense retrievers often lose to it outside their training distribution. A retriever fine-tuned on MS MARCO passages generalizes unevenly to biomedical text, legal text, or forum posts. BM25, which is not learned and therefore cannot overfit, holds steady. This is the first place in the chapter where the Sagan move helps: a vendor demo showing “dense retrieval beats BM25 by 15 percent on MS MARCO” is reproducible only on MS MARCO; on your own corpus, the number could go either way, and BEIR is the systematic evidence that it often goes the other way.

The mature operator move is not to pick one and be done. Run both and combine the results. This is hybrid retrieval. Dense retrieval catches the semantic match. BM25 catches the exact-term match. The union of their top candidates catches both cases.

The canonical way to combine them is reciprocal rank fusion, or RRF. For each candidate document, sum 1/(k + rank_in_system_1) and 1/(k + rank_in_system_2), where k is a small constant (60 is the common default). Sort by the sum. That is the whole algorithm. RRF is parameter-light and robust to the two systems using different similarity scales; you never have to align a cosine similarity of 0.73 with a BM25 score of 14.2. Bruch and colleagues (Bruch et al. 2023) analyzed RRF against convex-combination alternatives and found it a strong default.

Here is what that looks like in practice for the paper-summary task. Your query is “what dataset did the authors use for the 2019 benchmark?” Dense retrieval returns chunks ranked by semantic similarity: a chunk describing dataset collection (rank 1), a chunk on training procedures (rank 2), a chunk from the related work section that mentions “benchmark” (rank 3), and so on. BM25 returns chunks ranked by rare-word overlap: a chunk that contains the literal word “2019” and “benchmark” and a specific dataset name (rank 1), a chunk with the word “benchmark” (rank 2), a chunk with “2019” in a bibliography format (rank 3). RRF combines the two. The chunk that appears in both top-threes gets a high combined score. The chunks that appear only on one side still get considered but ranked lower. The reranker in the next section sees a small, high-quality candidate pool instead of a large, noisy one.

Elastic’s production writeup on hybrid retrieval (Elastic, n.d.) reports measurable gains across their benchmark set. That is a vendor report on a vendor benchmark, which earns the Sagan question: who benefits from the number being high? Elastic sells a search product that does hybrid well. Take the numbers as an existence proof that it helps somewhere, not as a promise it helps by that exact amount on your data. What makes the story credible is not Elastic’s number; it is that most production teams independently find the same pattern. Hybrid beats either channel alone. The infrastructure cost is modest because you are already running a search index and a vector index.

If you are building a RAG system and have not tried BM25 yet, you are leaving recall on the table. The “dense retrieval always wins” story was never true. BEIR killed it in 2021. The 2020s literature has been slowly catching up to the operator folk wisdom. Use both. Fuse with RRF. Move on to the harder problems downstream.

10.4 Reranking with cross-encoders

The retrieval stage, whether dense or hybrid, runs fast because it runs dumb. Bi-encoders encode queries and documents independently and compare them with a dot product. That independence is what lets you precompute the document embeddings and scale to millions of items. It is also the cost. The query never gets to “look at” the document during encoding. Any interaction between query and document has to squeeze itself into the angular distance between two final vectors.

This is a coarse comparison. Two passages can land at similar cosine distance to a query for entirely different reasons. One might answer the question directly. The other might share vocabulary with the question without answering it. The first-stage retriever cannot tell them apart. The top twenty from a bi-encoder retriever will usually include the correct answer somewhere, but the correct answer is often not ranked first, and the top five might be dominated by passages that sound related without answering anything.

Here is what this looks like in practice. Your query is “what evaluation metric did the authors use?” The retriever returns, in order:

  1. A chunk saying “We evaluate using standard metrics and compare against baselines.”
  2. A chunk saying “Previous work evaluated with BLEU and METEOR.”
  3. A chunk, three pages later, saying “We report MRR@10 on the dev split.”

The third chunk is the right answer. The first is a near-match in vocabulary: it contains “evaluate” and sits near other evaluation prose. The second is about related work. A bi-encoder pulling vectors from a precomputed index cannot easily prefer chunk 3 over chunk 1, because from the outside their embeddings both look like “evaluation-adjacent prose.” What it needs is a model that looks at the query and each chunk together and asks, “does this chunk actually answer this question?”

That is a cross-encoder. Instead of encoding query and document separately, it runs them through the model together, as a concatenated pair, and predicts a single relevance score directly. Every token of the query can attend to every token of the document (Section 2.7). The model sees the interaction explicitly. The output is a number, not an embedding, which means you cannot precompute anything. You have to run the cross-encoder fresh on every query-document pair at request time.

That is why you do not retrieve with a cross-encoder. Running it against a million-document corpus would take minutes per query. You use it to rerank: retrieve a modest candidate pool first with your cheap bi-encoder plus BM25 hybrid (say, top 50 or top 100), then rerank those candidates with the cross-encoder to pick the real top 5 to 10 for the model.

Notice the shape of this. The same family of operation, “score how well this query aligns with this document,” gets applied at two different substances of the pipeline. The cheap, scalable version runs over the whole corpus. The expensive, accurate version runs over the shortlist. Wigner’s old observation applies: the same mathematical move works at different scales, and the engineering is in choosing which scale gets which variant.

The two-stage retrieve-then-rerank pattern is the production default for good reason. Reimers and Gurevych (Reimers and Gurevych 2019) established the vocabulary. Khattab and Zaharia’s ColBERT (Khattab and Zaharia 2020) and Santhanam and colleagues’ ColBERTv2 (Santhanam et al. 2022) explored a middle ground called late interaction [a scheme that keeps per-token embeddings and runs a lightweight comparison at query time], cheaper than a full cross-encoder but richer than a single-vector bi-encoder. As of 2026, cross-encoder rerankers dominate production usage, with late-interaction methods a respected minority.

On the open-source side, BAAI’s BGE-Reranker family (Beijing Academy of Artificial Intelligence, n.d.) (base, large, and v2-m3) are the strongest freely available rerankers. Clavié’s rerankers library (Clavié 2024) gives you a unified interface to swap between BGE, Cohere, Voyage, and Jina without rewriting your retrieval code. Closed-source options (Cohere Rerank, Voyage Rerank) lead the field on some benchmarks. Whether the quality delta justifies the per-call cost is a measurement question for your own corpus. Ask Sagan’s question here too: the vendors reporting those benchmark numbers are selling the rerankers. Run the comparison on your own evaluation set before you believe the vendor’s number.

Reranking is not free and not always helpful. Jacob and colleagues’ Drowning in Documents (Jacob et al. 2024) analyzed reranker behavior as the candidate pool grows and found, past a point, rerankers start surfacing worse documents, not better ones. The intuition: rerankers were trained on query-document pairs that were plausibly related. They saw few adversarial distractors. When you feed a cross-encoder a pool of a thousand candidates, most of them noise, the model starts assigning spurious high scores to fluent-sounding but irrelevant passages. The practical consequence is that 25 to 100 candidates is a reasonable default for the pool size. Stretching to 500 or 1,000 “for safety” can degrade your top-K rather than improving it. More is not always better. Bigger pools carry more opportunity for the reranker to be fooled.

Back to the paper-summary task. You retrieve hybrid top-50 for the query “what does reference 17 claim?” The reranker runs on 50 query-chunk pairs and returns the top 5. Those 5 go to the generator. In a well-tuned pipeline the right passage is usually in the top 5. When it is not, the failure is almost always in the first-stage retrieval missing the chunk entirely, not the reranker mis-ranking it. That distinction matters for debugging. If the reranker is choosing poorly from its top 50, retrain or swap the reranker. If the right chunk never made it into the top 50, the failure is upstream in chunking or embedding.

Two-stage retrieve-then-rerank is the strongest shape for most pipelines. Hybrid retrieval (BM25 plus dense) for a 50-to-100 candidate pool, cross-encoder reranker for the final 5 to 10, generator consumes the reranked top-K. Everything fancier should be justified by a measured win on your own eval set, not by a blog post.

10.5 Query rewriting and HyDE

Retrieval quality depends on the query, not just the corpus. Users ask underspecified questions. They ask questions phrased nothing like the source documents. They ask compound questions that require multiple retrievals. If you point a user’s raw query at your retrieval system, you are leaving the rest of the pipeline to compensate for query-document mismatch, which it will do imperfectly.

Query rewriting is the family of techniques that transform the query, or generate auxiliary queries, before retrieval.

The simplest variants use the LLM itself. Query expansion asks the model to produce synonyms and related phrasings, then retrieves against the union. Query decomposition breaks a multi-hop question (“which of our APAC customers signed after the pricing change?”) into the sub-queries a database would need. Multi-query retrieval runs several rephrasings in parallel and fuses the results, typically with RRF. Each adds an LLM call to the critical path. You pay in latency. You gain in recall.

The most interesting single technique in this family is HyDE.

10.5.1 HyDE: embed the answer, not the question

Gao and colleagues’ Hypothetical Document Embeddings (Gao et al. 2023), HyDE for short, is one of those tricks that sounds silly until you realize why it works.

First, name the strangeness. User queries and source documents live in different distributions. Queries are short, interrogative, often fragmentary (“refund eligibility?”). Source documents are long, declarative, and contain the answer (“Customers may request a refund within 30 days of purchase, provided the following conditions are met…”). An embedding model trained on symmetric pair-similarity can still struggle to pull source passages to the top for queries that look nothing like them. A short question and a long answer end up in different cones of the embedding space, not because they disagree in meaning, but because their shape differs: one is interrogative and terse, the other declarative and dense. You built a geometry for measuring similarity between similar-shaped texts. You are now asking it to measure similarity between different-shaped texts. It does its best and falls short.

HyDE sidesteps the gap. Instead of embedding the query, you first ask a cheap LLM to write a hypothetical answer to the query: a plausible document passage that would answer the question if it existed. You embed the hypothetical passage and retrieve with that embedding. The hypothetical answer does not have to be factually correct, only linguistically close to what a real source passage would look like. A wrong hypothetical answer and a right real passage still land close in embedding space, because both are structured like answers, share the same vocabulary, and share the same register.

Here is what that actually looks like in practice for the paper-summary task. Your forty-page paper cites a 2019 result on a specific benchmark, and you want to pull the right passage from the cited paper to check the citation. The raw query is “the 2019 benchmark result,” four words, fragmentary, nothing like the prose a real paper passage would use. Most of the cited paper’s pages look nothing like those four words. You ask a cheap LLM to write a plausible one-paragraph passage that might appear in such a paper:

“On MS MARCO dev-small, the proposed model achieves 0.41 MRR@10, an improvement of 3.2 points over the prior state of the art [Nguyen 2018]. Gains are most pronounced on short queries, where…”

The passage is made up. It is not necessarily true of the actual paper. It is roughly the right shape: declarative, technical, using the vocabulary a real benchmark-result paragraph would use. You embed that hypothetical passage and run the retrieval. Real passages in the cited paper that describe real benchmark results sit much closer to this hypothetical passage than they sat to the four-word query. The hypothetical passage pulls the right real passages into view. There is no separate “semantic bridge module” anywhere in this. The trick is just: queries and answers live in different cones of embedding space, so convert your query into something answer-shaped before you embed.

The paper shows strong gains on zero-shot retrieval tasks where no fine-tuned retriever exists. In production, HyDE is a good option when:

  • Your embedding model was trained on symmetric similarity and struggles with the query-document distribution gap.
  • You do not have enough labeled data to fine-tune a retriever for your domain.
  • Your latency budget can absorb one extra LLM call per query.

HyDE is less useful when your queries are already document-shaped (RAG over a conversation corpus, or search inside a wiki where queries are often copy-pasted titles). It shares the failure mode of every LLM-based rewrite: if the hypothetical answer drifts too far from the true topic, the retrieval comes out worse than the raw query. Measure.

Related techniques (step-back prompting, self-ask, sub-query decomposition) show up widely in operator-facing frameworks. Many of them overlap with the ReAct-style reasoning loops we pick up in Chapter 12, where the LLM iteratively decides what to retrieve next. The common thread across all of them: the retrieval stage is no longer a single query. It is a small, LLM-driven program that issues multiple queries and fuses their results.

10.6 Contextual retrieval and the lost-context problem

One problem plagues every naive retrieval pipeline. Chunks strip context from the text they came from.

Consider a passage from an annual report: “Its 3.85 million inhabitants represented a 2.1 percent increase over the prior year.” Embedded on its own, this chunk matches queries about population and growth rates. It fails for a query like “population of Berlin in 2024,” because the word “Berlin” lives in a different chunk, probably the one several paragraphs earlier that said “The following chapter reviews Germany’s five largest cities. Berlin remains the largest.” Retrieval misses. The generator never sees the relevant passage. The answer is wrong or fabricated.

The same failure bites the paper-summary task. Chunk the forty-page paper into paragraphs. The query “what dataset did the authors use” returns a chunk that reads “We used the standard train/test split and trained for 40 epochs with Adam.” Useful prose. But the dataset name lives three paragraphs earlier, in a chunk the query did not pull. The summary misses the dataset. The chunk lost its context on the way into the index. The model sees a passage about training procedure and confidently writes “the authors used the standard Adam split” because that is the shape of sentence a training-procedure chunk invites. There is no dataset name anywhere in the model’s view at that moment.

Anthropic’s Contextual Retrieval (Anthropic 2024), published in September 2024, is the cleanest articulation of the fix. Before embedding each chunk, use a small LLM to generate a 50- to 100-token context sentence that situates the chunk within its source document (“This chunk is from Section 4 of Acme Corp’s 2024 annual report, discussing Berlin’s population growth”). Prepend the context to the chunk. Embed the combined text. Store both in the index. At retrieval time, you are matching against embeddings that already know what each chunk is about within the full document. For the paper-summary task, the training-procedure chunk becomes “This chunk is from Section 4 of the paper, describing training procedure on the Common Crawl Filtered v3 dataset.” Now the query “what dataset did the authors use” pulls the right chunk because the dataset name traveled with the chunk into the index.

Anthropic reports that contextual retrieval reduces retrieval errors by up to 49 percent on its own and up to 67 percent combined with BM25 and reranking, across their evaluation corpus. Those are vendor-reported numbers on a vendor-chosen benchmark. Ask Sagan’s questions before believing the promise. Is the result reproducible on independent benchmarks? Who benefits from it being true? Does a simpler explanation (better chunking without the contextualization step) fit the data? Ji and colleagues’ 2025 independent evaluation (Ji and Others 2025) compared contextual retrieval against late chunking (an alternative technique we cover in Chapter 11) and found contextual retrieval stronger on relevance quality but considerably more expensive to build. You pay one LLM call per chunk, upfront, forever, times every version of the corpus you ever re-index. For a corpus of a million chunks, at a few cents per call, you are out tens of thousands of dollars before the first query lands.

Prompt caching on the source document changes the math somewhat. If you run the same contextualization prompt across many chunks of the same document, caching the document prefix on the LLM provider’s side reduces the per-chunk cost to the incremental contextualization tokens. For corpora that change weekly, this works. For corpora that change hourly, it does not.

The operator decision is not “which chunking technique wins.” It is “what do I pay up-front, what do I pay at query time, and can I afford both?” We work that tradeoff at more length in Chapter 11.

10.7 Where RAG breaks

No amount of retrieval quality saves you from a generator that ignores the retrieved passages, distorts them, or confidently synthesizes an answer that contradicts them. RAG is an input-staging fix. The output-stage failure modes remain.

Retriever-generator mismatch: intrinsic hallucination. You retrieve the correct passage. The model ignores it. HalluLens’s intrinsic-hallucination category (Ravichander et al. 2025) captures exactly this: output that contradicts provided context. Hand the model the passage that says “we trained for 40 epochs,” and the summary that comes back says “the authors trained for 50 epochs.” The context was right. The model guessed anyway. The cure is not more retrieval. It is prompting (Chapter 6), output constraints (Chapter 8), and evaluation (Chapter 16) that specifically measures grounding. If your RAG system returns answers not supported by the retrieved chunks, more chunks will not help.

Lost in the middle. Liu and colleagues (Liu et al. 2024) measured retrieval accuracy as a function of where the relevant information sits inside a long context and found a predictable U-shape. Models reliably use the beginning of the context. They reliably use the end. They unreliably use the middle. For a RAG pipeline that retrieves ten chunks and hands them all to the model, chunks at positions four through seven are measurably less likely to influence the answer than chunks at positions one or ten. The mitigation is brutal. Place your strongest-ranked chunks at the top and the bottom of the retrieved block. Fill the middle with progressively lower-rank chunks. Prune aggressively. Five well-ranked chunks in a compact context beat twenty chunks smeared across a 30,000-token prompt. For the paper-summary task, this means you do not dump fifty reranked chunks into the prompt and hope. You dump five, place the strongest first and last, and let the model see all five clearly.

Distractor chunks degrade accuracy. Shi and colleagues (Shi et al. 2023) showed that retrieved passages irrelevant to the query reduce model accuracy, even when the correct passage is also present. The model reads everything you give it and tries to reconcile. If you retrieve ten chunks and only two are relevant, the other eight are not neutral. They are adversarial. This is a direct argument for aggressive reranking. The goal is not to retrieve as many potentially-relevant chunks as possible. It is to surface a small number of high-confidence chunks and stop.

No retrieval vs. bad retrieval. Related to distractor chunks but worth naming separately: empty context often beats confidently-wrong context. If the retriever returns nothing relevant, handing the model five chunks that look like they might be related (but are not) is worse than returning no context and letting the model say “I don’t know” (or fall back to a vendor-provided abstention behavior). Measure the floor: what does your system do when retrieval fails? If it fabricates with a confident tone, that is a deployment-blocker. The fix is in the generator prompt and the guardrails, not in the retriever.

Evaluation is load-bearing and usually absent. Most RAG failure modes stay invisible to the team building the system until a customer reports a specific bad answer. Building a real eval set, a few hundred questions with labeled correct passages and correct answers, is the single highest-value investment in RAG quality. We spend a full chapter on this in Chapter 16. Until that eval exists, every tuning decision you make is based on vibes.

We do not fully understand why rerankers degrade as candidate pools grow. We do not fully understand why “lost in the middle” shows that particular U-shape. We do not understand why some embedding models transfer to new domains while others collapse. These are open research questions, not solved engineering problems with rough edges. Treat the recommendations in this chapter as the current state of operator folk wisdom, informed by the best public evals we have, not as settled mechanism. Where the literature admits it does not know, the book admits it too.

RAG does not replace evaluation; it moves where the bugs live. Without retrieval, bugs are in the weights and you cannot fix them. With retrieval, bugs live in your chunking, your embedding model, your reranker, your prompt assembly, and your guardrails. Every one of those is in your code, in your repo, fixable by a team that can read an eval report. That is the real win.

10.8 Closing the loop

The pipeline, start to finish, looks like this:

  1. Ingest. Chunk your corpus (the how matters, see Chapter 11), optionally prepend contextual summaries, embed each chunk, store in a vector index, store the raw text in a keyword index (Lucene, Postgres full-text, Elastic, whatever).
  2. Query. Optionally rewrite the query (HyDE, expansion, decomposition). Retrieve top 50 to 100 candidates via hybrid retrieval (dense top-K plus BM25 top-K, fused with RRF).
  3. Rerank. Run a cross-encoder reranker over the candidate pool to produce a final top 5 to 10.
  4. Assemble. Stage the reranked chunks into the prompt in a sensible order (strongest at top and bottom). Include instructions about grounding (“answer only from the provided passages; say so if the answer is not present”). Include the user query.
  5. Generate. Model does its forward pass on the assembled context.
  6. Evaluate, continuously. Ground-truth labeled eval set. Measure retrieval recall, reranker precision, answer faithfulness, abstention rate.

Every box in that pipeline is a lever. Which lever moves your quality the most depends on where your failures currently live. Evaluation tells you where your failures live. The eval infrastructure gets load-bearing fast.

Back to the paper-summary task one last time. Ingest the paper and every paper it cites. Chunk each one, embed each chunk, store in your indexes. At summary time, generate sub-queries for each bullet point you want (dataset, method, main result, limitations, future work). For each sub-query, run hybrid retrieval, rerank the top 50 down to the top 5, assemble the top 5 into the prompt with clear instructions, generate the bullet. Evaluate the summary against a held-out set where you already know the right answers. That is the whole pipeline, every piece of this chapter in sequence.

The sharp-eyed reader may notice we have spent a chapter on retrieval without answering the question that gets asked in every 2026 LLM room: does RAG still matter now that context windows are million-token? Gemini 2.5 Pro’s 2 million token context with caching, plus Claude’s long-context behavior and OpenAI’s extended context models, have made “just stuff the whole corpus in the prompt” a viable option for some workloads. A 2-million token window is roughly the full text of War and Peace five times over, fed into the model on every request. The tradeoff is real. Retrieval adds engineering complexity and introduces retrieval-miss failure modes that long-context does not have. Long-context adds cost, latency, and the lost-in-the-middle attention dilution we saw in Section 3.4. Neither approach dominates. The right call depends on corpus size, query volume, latency budget, and the predictability of which parts of the corpus a given query needs.

We defer that debate properly to Section 14.6, where we can compare RAG against long-context caching in the broader frame of how state gets staged into a stateless function. For now, the operator default stands. If your corpus exceeds your context window comfortably, or your latency budget cannot tolerate million-token prefill, or you care about which chunks influenced which answer for audit or grounding, RAG is the right tool and the pipeline above is the right shape.

Anthropic. 2024. Contextual Retrieval. Anthropic News. https://www.anthropic.com/news/contextual-retrieval.
Beijing Academy of Artificial Intelligence. n.d. BGE-Reranker Model Cards (Base, Large, V2-M3). Hugging Face. https://huggingface.co/BAAI/bge-reranker-v2-m3.
Bruch, Sebastian, Siyu Gai, and Amir Ingber. 2023. “An Analysis of Fusion Functions for Hybrid Retrieval.” ACM Transactions on Information Systems. https://dl.acm.org/doi/10.1145/3596512.
Clavié, Benjamin. 2024. Rerankers: A Lightweight Python Library to Unify Ranking Methods. arXiv:2408.17344. https://arxiv.org/abs/2408.17344.
Elastic. n.d. Improving Information Retrieval in the Elastic Stack: Hybrid Retrieval. Elastic Search Labs blog. https://www.elastic.co/search-labs/blog/improving-information-retrieval-elastic-stack-hybrid.
Gao, Luyu, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. “Precise Zero-Shot Dense Retrieval Without Relevance Labels.” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). https://arxiv.org/abs/2212.10496.
Gao, Yunfan, Yun Xiong, Xinyu Gao, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. https://arxiv.org/abs/2312.10997.
Jacob, Mathew, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, and Andrew Drozdov. 2024. Drowning in Documents: Consequences of Scaling Reranker Inference. arXiv:2411.11767. https://arxiv.org/abs/2411.11767.
Ji, Chengyuan, and Others. 2025. Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation. arXiv:2504.19754. https://arxiv.org/abs/2504.19754.
Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.
Karpukhin, Vladimir, Barlas Oğuz, Sewon Min, et al. 2020. “Dense Passage Retrieval for Open-Domain Question Answering.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), 6769–81. https://arxiv.org/abs/2004.04906.
Khattab, Omar, and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 39–48. https://arxiv.org/abs/2004.12832.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.11401.
Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.
Ravichander, Abhilasha, Ellen Vitercik, Aditya Gupta, et al. 2025. “HalluLens: LLM Hallucination Benchmark.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://arxiv.org/abs/2504.17550.
Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019), 3982–92. https://arxiv.org/abs/1908.10084.
Robertson, Stephen, and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Vol. 3. Foundations; Trends in Information Retrieval, Now Publishers. https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf.
Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022). https://aclanthology.org/2022.naacl-main.272/.
Shi, Freda, Xinyun Chen, Kanishka Misra, et al. 2023. “Large Language Models Can Be Easily Distracted by Irrelevant Context.” Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2302.00093.
Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.
Thakur, Nandan, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-Shot Evaluation of Information Retrieval Models.” Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Datasets and Benchmarks Track. https://arxiv.org/abs/2104.08663.