Chunking Strategy | Working with Large Language Models | Synoros

Before a retrieval system can find the right passage of a forty-page paper, somebody had to decide where that paper gets cut into pieces. That single decision quietly determines everything that comes next.

The chunk is the slice of source text that gets embedded, stored, retrieved, and shoved into the context window. Chapter 10 built the retrieval pipeline around it: embed, index, search, rerank, hand the winners to the model. Every stage operates on chunks. Get the chunk wrong and nothing downstream rescues you. The reranker a smaller model that re-scores retrieved passages for relevance, covered in 10.4 cannot re-interpret a mangled boundary. The generator cannot hallucinate back missing context. The embedding model [the network that converts a chunk of text into a list of numbers representing its meaning] cannot care about a topic it never saw in one piece.

This is the chapter where RAG retrieval-augmented generation, the pattern of looking up relevant text and slipping it into the prompt, covered in 10 keeps breaking in production. Not because the embedding model is bad. Not because the vector database is slow. Because the input to everything is a set of text windows that someone, at some point, decided where to cut.

The paper-summary task makes this concrete. That forty-page research paper is open again. This time it is too long to paste in as one block. The application behind your chat interface has to split the paper into pieces, turn each piece into a vector, store them, and later fetch the pieces most relevant to whatever you ask. Where the application draws those cuts is the entire subject of this chapter. Picture the paper’s Section 4. If the chunker put the first half of Section 4 in one piece and the second half in another, with a third chunk straddling the boundary, your summary of Section 4 will be slightly wrong. You will never see the cut. You will only see the wrongness, and you will not know where it came from.

The 2024 to 2026 literature on chunking is unusually active. Two architectural moves, contextual retrieval from Anthropic and late chunking from Jina, frame themselves as solutions to the same problem by opposite means. A 2025 peer-reviewed evaluation (Ji and Others 2025) gives us our best vendor-neutral data point on what works.

11.1 Why chunking exists at all

Two constraints from earlier in the book meet here and squeeze your paper from opposite sides.

The first is the stateless function a function that keeps no memory of its past use, introduced in 1.1 from Section 1.1. Every request you send is a fresh request. Whatever you want the model to consider on this turn has to fit in the context window [how much prior text the model can see at once on a single request], because the server throws everything away once the response is sent. The second constraint comes from the retrieval pipeline in Chapter 10. The embedding model [the network that turns text into a list of numbers] was trained on passages of specific lengths, typically 128 to 512 tokens. If you hand it a longer passage, most encoders either silently truncate the input or quietly degrade in quality (Reimers and Gurevych 2019).

Both sides have a size budget. Neither budget is the size of a typical source document.

Think about what your corpus contains. A PDF of a software manual. A forty-page legal contract. Your forty-page research paper. A customer-support log stitched together out of thirty back-and-forth messages. None of these are the right shape to embed as a single object. None are the right shape to dump into the context window either. The paper is ninety thousand words in one piece; the encoder wants something closer to four hundred.

There is a second, subtler problem with feeding the generator one long document anyway. Even when the whole paper fits, Lost in the Middle (Liu et al. 2024) established that the generator’s attention to material in the middle of the window collapses. You can hand a modern model a two-hundred-thousand-token context and it will still read the first few pages and the last few pages more carefully than the middle. So even if the paper fit, putting it in as one block would leave the middle sections underweighted in whatever answer the model produced.

The chunk’s job is to produce a unit of retrieval that is short enough to embed faithfully, large enough to mean something on its own, and topically coherent enough that a vector pointing at it points at one idea rather than an average of ten.

The original RAG paper (Lewis et al. 2020) chunked Wikipedia into 100-word passages. That convention haunts the field because nobody has a clean principled reason the number is 100 rather than 200 or 500, and everyone picked a number and shipped it. There is no universal correct chunk size. The correct chunk size is a property of your documents, your queries, your embedding model, and the downstream task, in that order of importance. A book about chunking that gives you a single number is lying to you.

The chunk is where retrieval quality is decided. Embedding models, rerankers, and generators are all downstream of a decision someone made about where to cut. If chunking is wrong, nothing downstream fixes it. This is the single most under-discussed stage of the RAG pipeline, and the one where most production systems silently lose accuracy.

Two consequences follow. First, chunking is not a preprocessing step you do once and forget. It is a hyperparameter [a tunable setting that controls how the system behaves] with a downstream quality signal, and you should measure it the way you measure any other hyperparameter. Second, the best chunk is not always the biggest chunk that fits. Semantic purity matters more than length. A 300-token chunk that covers one topic cleanly will outretrieve a 512-token chunk that covers one-and-a-half topics. The embedding of the longer chunk averages two concepts into one vector, and the similarity score between your query and the chunk ends up measuring that average rather than either concept on its own.

Back to the paper. Your forty-page paper has an abstract, a methods section, a results section, a discussion, and a reference list. If one chunk straddles the end of methods and the start of results, its embedding averages “what the authors did” with “what they found.” A query asking what the authors did will match it weakly. A query asking what they found will match it weakly. A clean chunk ending at the methods-to-results boundary would have matched either query strongly.

That is the whole shape of the problem. The rest of this chapter is what people have tried to do about it.

11.2 Fixed-size chunking and the folk default

The simplest possible method works surprisingly well. Take the document. Slide a window of some fixed token count across it. Emit each window as a chunk. Overlap the windows so that a passage sitting on a boundary gets caught by both neighbors instead of sliced in half. That is it. No NLP, no parsing, no embedding of anything until retrieval time.

The typical operator defaults that shipped with LangChain and LlamaIndex around 2023 and have propagated everywhere since:

chunk_size = 512      # tokens
chunk_overlap = 50    # tokens of overlap between adjacent chunks

A 512-token window with 50 tokens of overlap means each chunk shares its first 50 tokens with the previous chunk’s last 50. If a sentence straddles the 512-token boundary, it appears in full somewhere because the overlap window catches it. The cost is roughly 10% more storage and compute on the embedding side. Usually fine.

Applied to your paper, this means cutting the paper into eighty or so 512-token chunks, each one about two paragraphs’ worth of text, each one sharing a short strip with its neighbors. No respect for the section breaks. No respect for where the methods end and the results begin. Just a rolling window dropping chunks out the back.

Two reasons this baseline keeps winning more often than it should. First, at 256 to 512 tokens most modern encoders sit inside their training distribution and produce clean embeddings. Second, fixed-size chunking is trivially reproducible. You can explain it, test it, and diff it across experiments. Anything more clever hides knobs that introduce variance across runs.

Where fixed-size chunking fails is obvious from the mechanism. A paragraph break, a section header, a code fence: none of these are respected. If the 512-token window lands mid-code-block in your paper’s methods section, the chunk contains half a function plus the closing prose of the preceding results section, and the embedding averages “what this code does” with “what the previous paragraph argued.” That is exactly the semantic smearing we just named as the enemy of retrieval quality.

The first fix is cheap. Respect the most obvious structural boundaries before you fall back to fixed-size. This is what LangChain calls RecursiveCharacterTextSplitter. It tries to cut at paragraph breaks first, then sentence breaks, then words, before resorting to raw character count when nothing larger fits. It is still fundamentally fixed-size, but with the dial turned slightly toward structural awareness. The folk wisdom that “recursive character splitting is the sensible default” is mostly right. It is cheap, it does not embarrass itself on well-formed text, and it is the baseline any more elaborate method has to beat.

Kamradt’s 5 Levels of Text Splitting (Kamradt 2024) is the most widely-cited practitioner taxonomy here, and it is a good read because it is honest about the tradeoffs. Fixed-size is level one. Recursive with structural preferences is level two. Everything above is where the research papers live.

11.3 Semantic and structure-aware chunking

Fixed-size chunking ignores meaning. Semantic chunking tries to respect it. The core idea: embed every sentence in the document, measure how similar each sentence is to the one next to it using cosine similarity a number from -1 to 1 measuring how close two embedding vectors point in the same direction, covered in 10.2, and cut the document wherever that similarity drops sharply. The assumption is that a topic shift shows up as a sudden drop in similarity between neighbors, and the chunks that come out are topically coherent blocks.

What does this mean for your paper? Suppose the last sentence of the methods section is “we trained for forty epochs with a batch size of sixty-four,” and the first sentence of the results section is “accuracy on the held-out test set reached 87.3 percent.” Those two sentences, despite standing next to each other on the page, talk about different things. An embedding-space similarity score between them will be low compared to the scores between consecutive sentences inside the methods section. A semantic chunker notices the dip and cuts there. You end up with a chunk that stops cleanly at the end of methods and a new chunk that begins with results. That is the dream.

In practice, this is expensive. Every sentence has to be embedded just to compute where the splits go, before you have any retrieval chunks at all. Your forty-page paper has maybe 1,500 sentences. Now multiply by a corpus of a thousand papers and you are doing 1.5 million sentence-embeddings at ingest, most of which get thrown away (the per-sentence embeddings used to find the cuts are not the same as the per-chunk embeddings you will store). If you are using a cloud embedding API, this doubles or triples your ingest cost.

There is a second, subtler problem. Sentence-to-sentence similarity is a noisy signal. Two adjacent sentences in the same paragraph might have a low similarity score because one introduces a new subtopic while staying within the broader topic. Worse, the specific threshold that defines a “drop” is a hyperparameter that nobody has a principled way to set. Operators typically pick the 90th or 95th percentile of similarity-drop magnitudes across the document. Reasonable. Not theoretically grounded.

The 2024 to 2025 research line has tried to do better. Zhao et al.’s Meta-Chunking (Zhao et al. 2024) uses an LLM itself to identify split points based on “logical perception.” In practice that means running a small model over sliding windows and asking, for each sentence, whether it continues the current topic or starts a new one. Liu et al.’s HiChunk (Liu and Others 2025) takes the hierarchical route. It clusters sentences at multiple granularities and builds a tree of chunks, so retrieval can pull a sentence, a paragraph-equivalent, or a whole section depending on how specific the query is. Both methods are real improvements over naive similarity-threshold segmentation on their reported benchmarks (NarrativeQA, QuALITY, QASPER). Both cost an order of magnitude more at ingest.

Structure-aware chunking is the other branch. Rather than learning the boundaries, use the ones the document already has. Markdown has headings and paragraphs. HTML has DOM [document object model, the tree structure a browser uses to represent a webpage] structure. PDFs have page breaks and often section boundaries from the table of contents. Code has function definitions and class scopes. Respect these and you get chunks a human would recognize as natural units, for free. The practical recipe: use the document’s inherent structure first, and fall back to learned or similarity-based segmentation only where structure runs out.

Your paper is a helpful case. Most research PDFs, if the PDF-to-text extractor did its job, carry explicit section headers: Introduction, Methods, Results, Discussion, References. A chunker that splits on those headers and then enforces a maximum token count inside each section is often the best chunker you can buy for technical documentation, and it costs almost nothing at ingest. A query asking about methodology retrieves the methods chunks and nothing else. A query asking about findings retrieves the results chunks. The work the semantic chunker was trying to do by computing 1,500 sentence similarities was already done for you by the author when they typed ## Methods at the top of section three.

“Semantic chunking is better than fixed-size” is a folk claim ahead of the evidence. Ji et al.’s 2025 replication (Ji and Others 2025) found that well-tuned fixed-size chunking with overlap is competitive with most semantic-chunking methods on standard benchmarks, while costing 10 to 100 times less at ingest. Semantic chunking wins cleanly only when document structure is genuinely irregular (stitched customer-support transcripts, OCR [optical character recognition, the process of extracting text from images of pages] output, scraped HTML where the DOM has been flattened). Measure before paying for it.

This is a place to run the skeptical checklist in front of the reader. “Semantic chunking beats fixed-size” is a claim in wide circulation. Is it reproducible? Ji et al. says: not cleanly, outside specific regimes. Does a simpler explanation fit? Yes: people measured semantic chunkers on corpora where the document structure was already clean, and credited the chunker for work the document had already done. Who benefits from it being true? Vendors of semantic-chunking tooling, and operators who have invested engineering effort in building it. The claim survives under controlled conditions. It does not survive as a default.

The editorial take: if your source documents have structure (Markdown, clean HTML, well-formatted PDFs, structured databases), use it. If they do not, start with recursive fixed-size at 256 to 512 tokens with 50 token overlap and escalate to semantic segmentation only when you have measured a specific failure that motivates it. The assumption that meaning-aware chunking is always better is one of the more expensive false defaults in RAG engineering.

11.4 Late chunking

Here the story gets interesting. The methods above all share a sequence: cut the document into chunks first, then embed each chunk independently. The chunk embedding comes from the chunk’s text alone. Neighboring chunks do not inform it. Distant context does not inform it.

Imagine page 14 of your paper says “Its GDP is 3.2 trillion dollars.” In a chunk-then-embed pipeline, the chunk containing that sentence has no idea what “it” refers to. The antecedent (“Germany,” named on page 6) lives in a different chunk, embedded at a different time, with no connection between them. The page-14 embedding sees five tokens about GDP and nothing about Germany. A query about Germany’s economy will not retrieve page 14.

Günther et al.’s Late Chunking (Günther et al. 2024), published by Jina AI in September 2024 and updated through 2025, inverts the sequence. Encode the full document with a long-context embedding model first. Then, after the document has been processed as a whole, pool [combine many vectors into one by averaging them together] the token-level embeddings into chunks. The chunk embeddings now come from token representations that saw the entire document during encoding, which means each chunk’s embedding is already context-aware.

“Its GDP is 3.2 trillion dollars” comes out with a vector that points at Germany. The model saw “Germany is the largest economy in Europe” three chunks earlier, and that context leaked into the token-level representations through attention, the 2015 mechanism from Chapter 2 that lets the model look back at every input word while generating each output word. By the time the pooler averages the token embeddings for page 14’s chunk, those tokens already carry a trace of every other token in the document, including the ones on page 6 that named Germany.

The prerequisite is a long-context embedding model. Jina’s v3 (8K context), Nomic’s v2 (8K context), and BGE-M3 (8K context) are the usual candidates. Shorter-context encoders cannot do late chunking, because “the full document” has to fit in one encoding pass. For documents larger than the encoder’s context, you can window-with-overlap at the encoding stage and pool per-window, but that mostly reintroduces the problem you were trying to solve.

The attractive property of late chunking is that it is retrofittable. If you have an existing RAG pipeline, you do not need to change how you store, query, or rerank. You just change the embedding step from “chunk, then embed each” to “embed full document, then pool per chunk.” The index schema stays the same. The retrieval code stays the same. Compared to contextual retrieval (next section), which rewrites the chunks themselves, late chunking is architecturally light-touch.

The drawbacks are worth naming. First, the encoder’s context window is a hard ceiling. A 200,000-token document (roughly the length of Crime and Punishment) will not fit in any current embedding model. Second, the pooling step is a lossy operation that does not fully preserve the context the encoder picked up. Mean-pooling over the token-level embeddings that fall within a chunk’s span throws information away. The chunk embedding is better than it would be without late chunking, but it is not as rich as if you gave the full document’s embedding a dedicated slot. Third, your evaluation has to cover the case where the document is near the encoder’s context limit, because that is where late chunking’s behavior becomes subtle.

Back to your paper. Page 14 says “we therefore adopt this approach.” This approach was defined on page 6. With independent chunk embeddings, the page-14 chunk has no idea what this refers to. Its vector captures “we therefore adopt something” and nothing more specific. With late chunking, the encoder’s attention already bridged page 6 to page 14 during the full-document pass, so page 14’s chunk embedding carries a trace of page 6’s proposal. A query like “what approach did the authors adopt” now retrieves page 14’s chunk reliably, because the chunk’s vector knows what “this” meant.

On reported benchmarks, late chunking gives modest but real improvements over naive chunk-then-embed on standard retrieval tasks. The gains are single-digit percentages on MTEB-like evaluations, larger on benchmarks specifically designed to stress coreference [when a pronoun or phrase refers back to something named earlier] and cross-chunk context. If your corpus has a lot of “it,” “they,” “the system,” or other references that resolve across paragraph boundaries, late chunking is the cheapest way to buy a quality uptick without rewriting your indexing pipeline.

11.5 Contextual retrieval

The other big 2024 move, Anthropic’s contextual retrieval (Anthropic 2024), attacks the same problem from the opposite direction. Instead of encoding the full document and pooling, generate a short per-chunk context string with an LLM, prepend it to the chunk, and embed the combined result.

The context is 50 to 100 tokens, produced by handing the LLM the full document plus the chunk to be contextualized and asking for a short summary of where this chunk fits. For your research paper, a contextualized chunk might look like:

This chunk is from the Methods section of Ravichander et al. 2025,
specifically describing the HalluLens benchmark's extrinsic evaluation
protocol. The chunk contains the specific prompt templates and the
associated scoring rubric.

The chunk stored in the index becomes the generated context plus the original chunk text. When a query comes in, the retrieval embedding gets computed over this augmented version. A query like “how does HalluLens score extrinsic hallucinations” now retrieves that chunk reliably, because the augmented chunk explicitly contains the section name, the paper’s authors, and the phrase “scoring rubric.” The raw chunk might have said only “we use the rubric defined in Table 3,” which an embedding cannot match to the query, because “Table 3” and “extrinsic hallucination scoring” do not look alike in embedding space unless something makes them look alike.

Anthropic reports a 49% reduction in top-20 retrieval errors from contextual retrieval alone on their evaluation corpus, and 67% when combined with BM25 a keyword-matching scoring algorithm from classical information retrieval, covered in 10.3 hybrid search plus a reranker. These numbers come from the vendor, on the vendor’s corpus, which means we should hold them lightly. Run the skeptical checklist against them. Is the claim reproducible? Directionally, yes: Ji et al. (Ji and Others 2025) and the clinical replication (multiple authors 2025) both show gains in the same direction, though with smaller magnitudes. Does a simpler explanation fit? The mechanism is clean enough that “the chunks got more retrievable because they now contain the words people search for” is the whole story. Who benefits from it being true? Anthropic, whose API sells the LLM calls that generate the contexts. The claim survives, slightly hedged, but the specific numbers (49%, 67%) are vendor-on-vendor-corpus numbers and should not be taken as universal.

The cost lives at ingest time and is significant. Generating 50 to 100 tokens of context per chunk, with the full document in the prompt, costs roughly one prompt-cached read of the document plus one small generation per chunk. With Anthropic’s prompt caching, the document gets re-used across its many chunks, so the amortized cost per chunk stays close to the generation cost of a short string plus cache reads. For a 10,000-chunk corpus you pay for 10,000 LLM generations at ingest. At 2026 API rates this runs a few dollars per thousand chunks with prompt caching, so tens of dollars for a mid-sized corpus. Real money, not budget-breaking, and a one-time cost.

The implementation gotcha worth naming: the generated context has to be produced with the full document in scope, not just the chunk itself. That is the whole point. A context produced from the chunk alone would say “This chunk contains revenue figures,” which is useless. A context produced from the full document says “This chunk is Acme’s 2023 Enterprise segment revenue breakdown,” which is the retrieval-usable information.

Late chunking and contextual retrieval are opposite solutions to the same problem. Both attack the same weakness: chunk embeddings lose document-level context. Late chunking keeps the chunks intact and makes the embeddings smarter by encoding the full document first. Contextual retrieval keeps the embedding model unchanged and makes the chunks richer by prepending LLM-generated context. Late chunking is cheaper at ingest but requires a long-context encoder. Contextual retrieval costs more at ingest but works with any encoder and produces chunks a human can inspect.

The same family of operation applied to two different substances: both methods reach into document-level context and carry it down into chunk-level representations. Late chunking does it through attention, inside the encoder, in the language of vector representations. Contextual retrieval does it through generation, outside the encoder, in the language of plain English. The shape of the move is the same. The substance differs.

Which to pick? Ji et al.’s 2025 evaluation (Ji and Others 2025) compared them directly on RAG benchmarks and found contextual retrieval generally better on downstream answer quality, and late chunking generally cheaper and sufficient for many workloads. The peer-reviewed clinical-decision-support replication (multiple authors 2025) reached a similar split: contextual retrieval wins on relevance, late chunking wins on efficiency. Both are real options in 2026. A pragmatic default is to start with late chunking (cheaper, easier retrofit) and upgrade to contextual retrieval if your accuracy measurements motivate the ingest-cost premium.

11.6 Chunk size, lost-in-the-middle, and the budget tradeoff

Chunk size and retrieval count are linked tradeoffs, and the link gets set by the generator’s effective context window, not the advertised one.

Small chunks mean many chunks. If you chunk your paper at 256 tokens and retrieve the top 10, the generator sees 2,560 tokens of retrieved context plus the system prompt, conversation history, and query. The chunks are topically pure, but there are ten of them, and the generator has to weight each one. Lost in the Middle (Liu et al. 2024) tells us what happens here. The generator leans heavily on chunks 1 and 10 (first and last in the context) and underweights chunks 4 through 7. Ten small chunks maximize recall at the cost of middle-of-context attention.

Large chunks mean few chunks, with more total irrelevant text per chunk. If you chunk at 2,048 tokens and retrieve the top 3, the generator sees the same six thousand or so tokens organized as three long passages. Each passage contains the relevant answer surrounded by a lot of context that is not needed for this query. Shi et al.’s Large Language Models Can Be Easily Distracted by Irrelevant Context (Shi et al. 2023) demonstrated, under controlled conditions, that appending distractor content to a prompt measurably lowers answer accuracy even when the relevant content is still present and unchanged. Large chunks carry more distractor mass per retrieved passage.

Neither extreme wins. The operator sweet spot, consistent across multiple 2024 to 2025 evaluations, is 4 to 10 chunks in context for single-turn question answering, with individual chunk sizes in the 256 to 512 token range for most corpora. To make the budget concrete: 10 chunks at 512 tokens each is roughly the length of a long newspaper feature, or about one short story from a collection. That fits comfortably inside most modern context windows with room left for the question, system prompt, and response. Inside that band, tuning the specific size and count for your corpus is a measurement problem, not a theoretical one.

Two exceptions worth naming. First, if your queries are aggregative (summarize this product line across quarters, give me the main arguments of this paper), more chunks help up to the limit where lost-in-the-middle dominates. The right move is reranking aggressively, so only the truly relevant top-N make it in, rather than retrieving fewer chunks to begin with. Second, if your queries are pointed and factual (what was Acme’s Q3 2024 revenue, what sample size did the authors of the paper use), fewer chunks help because each additional chunk is almost certainly distractor mass. For factual-QA workloads, top-3 with a strong reranker beats top-10 unranked.

The paper-summary task sits closer to the aggregative end. You want the model to produce a summary that touches every major section. That pushes you toward more chunks, because a summary of a forty-page paper built from three chunks is a summary of three pieces of the paper and nothing else. But ten chunks at 512 tokens each, fed in without a reranker, leave the middle chunks ignored. The right architecture is “retrieve 20, rerank to 8, feed the 8 to the generator,” which buys you broad coverage at the retrieval stage and tight coverage at the generation stage.

Anthropic’s contextual retrieval writeup (Anthropic 2024) uses top-20 with a reranker on their benchmark. Notice that “many chunks plus a reranker” is a different architecture from “few chunks with no reranker.” The choice of how many to pre-rank is a separate dial from how many to feed the generator. The generator should almost always see 4 to 10. The pre-ranker’s input pool can be much larger, because its job is to recover recall without paying the lost-in-the-middle tax at generation time.

Tune chunk count to the generator, not to the retriever. The retriever can surface 50 candidates cheaply. The generator can only attend to about 4 to 10 before lost-in-the-middle degrades its answer. If you are feeding the generator more than 10 chunks, you are paying tokens for recall the generator cannot use. Put a reranker in between and feed the generator the top 4 to 10.

The interaction with chunk size comes back through the embedding model. Smaller chunks (256 tokens) give the embedding a purer topic, but the model has less text to compute its vector from, and embeddings over short text are noisier. Some encoders were trained with a minimum passage length in mind, and passages below that length retrieve worse because the embedding is dominated by boilerplate. If you are using a named encoder (BGE-large, Jina v3, voyage-3), read its model card and note the training distribution before you pick a chunk size. A 256-token chunk on a model trained on 512-token passages is a 256-token chunk on the wrong model.

11.7 Evaluating chunking, honestly

How do you know your chunking is working? The temptation is to measure retrieval quality in isolation (recall@K [the fraction of queries where a correct chunk appears in the top K results], MRR [mean reciprocal rank, which rewards correct chunks for ranking higher], NDCG [normalized discounted cumulative gain, a graded-relevance ranking metric]) and treat those numbers as a proxy for downstream answer quality. This is the wrong measurement for RAG.

Retrieval-only metrics reward the chunker for putting the right chunks in the top-K. They do not penalize the chunker for putting distractor chunks also in the top-K. They do not reward the chunker for chunks that are easier for the generator to use. A chunk that contains the answer but is surrounded by 500 tokens of irrelevant legal boilerplate scores the same as a clean 100-token chunk that isolates the answer. The second chunk is strictly better for RAG. Retrieval metrics cannot tell them apart.

Here is what the right measurement looks like on your paper. Fix the embedding model, the reranker, the generator, and the prompt template. Build a question set, thirty or so real questions about the paper, with reference answers written by a human who read the paper carefully. Now vary only the chunking strategy. Run the whole pipeline, paper through chunks through retrieval through generation, with fixed-size chunks. Record the answers. Do it again with semantic chunks. Again with late chunking. Again with contextual retrieval. For each run, have a judge (usually an LLM-as-judge using a language model to score another model’s outputs, covered in 16, because judging thirty answers by hand is tedious and judging a thousand is infeasible) score each answer against the reference. Compare.

That comparison is expensive. It costs more than retrieval-only evaluation, because you need a labeled question set and a judge. It is also the only measurement that catches the full chunking effect, because it is the only measurement that includes the generator’s reading of the chunks.

The Ji et al. 2025 replication (Ji and Others 2025) runs this kind of evaluation across chunking strategies and is the closest peer-reviewed reference for vendor-neutral numbers. Their headline finding: the effect size of chunking choice is real but smaller than the effect size of retriever choice or reranker choice. Translated: do not spend weeks tuning your chunker if your embedding model is wrong for your domain. Fix the big knobs first.

There is a gap in the literature worth naming plainly. No single peer-reviewed paper we could find as of April 2026 runs a fully end-to-end comparison of fixed-size, recursive, semantic, late, and contextual chunking on the same corpus with the same generator and first-party measurements published in detail. We do not fully know how these methods compare at the level of specificity an operator wants. Ji et al. comes closest but does not cover every strategy. The clinical-decision-support replication (multiple authors 2025) is rigorous on one clinical domain but may not generalize. This is a genuine first-party benchmark opportunity for operators. A systematic evaluation on a fixed corpus with open numbers would be a durable contribution to the field.

If you are shipping RAG at any scale, you need a chunker eval harness. Build one that ingests the same corpus under multiple chunking strategies, runs the same retrieval and generation pipeline against each, and reports downstream answer accuracy on a held-out question set. The harness becomes the artifact that tells you whether your next engineering sprint should touch chunking at all. Without it you are guessing.

11.8 Defaults, recommendations, and honest editorial calls

Pulling this together into what to do Monday morning.

If you have structured documents (Markdown, clean HTML, well-formatted PDFs with recoverable section structure, code with function or class boundaries, databases with record boundaries): use the structure. A chunker that splits on H2 or H3 headings for Markdown, on DOM boundaries for HTML, on function definitions for code, with a fallback to 512-token recursive fixed-size inside sections that are too long, gives you clean chunks for free. This is the best cost-to-quality ratio in the field. A research paper, if its PDF-to-text extraction respects section headers, falls into this category. Split on Introduction, Methods, Results, Discussion, References, enforce a 512-token ceiling inside each section, and ship.

If you have unstructured or messy documents (stitched support transcripts, OCR output, scraped text where DOM has been flattened, plain-text corpora): start with recursive fixed-size at 512 tokens with 50-token overlap. If you measure a specific failure (coreference across boundaries, topic mixing inside chunks), consider semantic segmentation, or better, late chunking with a long-context encoder.

If coreference across chunks is your bottleneck (queries that refer to entities introduced in earlier chunks, “it” pronouns, anaphora [any linguistic reference back to something previously named] generally): late chunking is the cheapest retrofit. Swap your embedding step, keep everything else.

If retrieval precision is your bottleneck (the right chunk retrieves with the wrong neighbors, or retrieval misses it entirely in favor of a topically-similar-but-wrong chunk): contextual retrieval is worth the ingest cost. Measure the improvement on your downstream evaluation before committing to it permanently, because ingest costs add up.

Regardless of chunking choice, retrieve more than you generate on. Pre-rank your top-50 to top-100 candidates with a fast biencoder [an embedding-based retriever that scores queries and documents independently], rerank the top 20 to 30 with a cross-encoder [a reranker that scores a query and a candidate chunk together, more accurate but slower], and feed the generator the top 4 to 10. Chunk count at the generator is where lost-in-the-middle lives. Chunk count at the retriever is where recall lives. They are separate dials.

Applied to the paper-summary task, that recipe looks like this. Your corpus is ten thousand research papers. Each paper splits on its section headers, with a 512-token ceiling inside each section. A typical paper yields twenty to thirty chunks. At retrieval time, a query (“what methods did Ravichander et al. use to evaluate extrinsic hallucinations”) goes through the biencoder to pull the top 50 chunks across the whole corpus, then through the cross-encoder to rerank to the top 8, then into the generator along with the question. The answer the generator produces is built from eight clean, section-bounded, well-embedded chunks. Every link in the pipeline is doing its own job. The chunker’s job was the shape of the chunks. The retriever’s job was recall. The reranker’s job was precision. The generator’s job was synthesis. If any one of them is broken, the others cannot rescue it.

One more editorial call. The 2024 to 2026 literature on chunking is, by academic standards, scattered and underpowered. Operator-community writing (Kamradt, the LangChain docs, the Anthropic blog) got to most of the operational insights before peer review caught up. The peer-reviewed work that has arrived is often replication rather than new invention. This is a field where reading operator blogs alongside papers gives you a better map than papers alone. That is a defensible posture in 2026. It will probably tighten up by 2028 as rigorous comparative evaluations become standard.

The thing not to do is treat chunking as a solved problem. It is not. The defaults above work for common cases, fail quietly on edge cases, and need to be measured in your specific system before you trust them. The discipline is the same as with any other engineering decision: have a default, have a measurement, and be willing to move when the measurement tells you to.

The useful handle: the chunk is a unit of meaning made visible to a stateless function. Cut it where meaning lives. Measure what your generator does with it. Everything else is a consequence of those two choices.

Anthropic. 2024. Contextual Retrieval. Anthropic News. https://www.anthropic.com/news/contextual-retrieval.

Günther, Michael, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. 2024. Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. arXiv:2409.04701. https://arxiv.org/abs/2409.04701.

Ji, Chengyuan, and Others. 2025. Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation. arXiv:2504.19754. https://arxiv.org/abs/2504.19754.

Kamradt, Greg. 2024. 5 Levels of Text Splitting. Public notebook / talk. https://github.com/FullStackRetrieval-com/RetrievalTutorials.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.11401.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

Liu, Shuo, and Others. 2025. HiChunk: Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking. arXiv:2507.09935. https://arxiv.org/abs/2507.09935.

multiple authors. 2025. “Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support.” JMIR AI (Peer-Reviewed). https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019), 3982–92. https://arxiv.org/abs/1908.10084.

Shi, Freda, Xinyun Chen, Kanishka Misra, et al. 2023. “Large Language Models Can Be Easily Distracted by Irrelevant Context.” Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2302.00093.

Zhao, Jihao, Zhiyuan Ji, Pengnian Qi, et al. 2024. Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception. arXiv:2410.12788. https://arxiv.org/abs/2410.12788.