Show the model two or three examples of what you want, and it will copy the pattern for the fourth.
That is few-shot priming. In-context learning. Multi-shot prompting. Same mechanism under three names, depending on which research paper or vendor documentation you are reading. The previous chapter (Chapter 6) covered the policy layer [the always-included instructions that sit above your message] that sits above every request. This chapter covers the other big lever an operator has inside a single call: putting examples in front of the task.
Most readers arrive with one instinct: more examples is better, and the examples should be correct. Both halves of that are partially wrong, and the ways they are wrong matter. In 2022, Sewon Min and her colleagues ran a study that is still the single most-often-skipped finding in popular few-shot writing: demonstrations teach the label distribution rather than the task (Min et al. 2022). Once you internalize what that sentence is saying, several defaults you might have picked up from prompting tutorials start to look different.
The book’s running task, as a reminder, is pasting a forty-page research paper into the model and asking for five bullet points. Few-shot is the lever you reach for when your instruction alone does not pin down the output shape. You show the model two paraphrased summaries of unrelated papers with the exact structure you want, then paste your actual paper at the bottom. The model copies the structure. It does this without being retrained, without being fine-tuned [further training on a specific task], without any weight updating at all. The whole trick lives inside the token sequence you ship on this one request.
The goal at the end is not a list of tricks. The goal is that when a task is underperforming, you can tell whether adding examples would help, whether they would hurt, how many to add, and where to put them. That diagnostic is the job.
7.1 What few-shot priming is
The mechanics are embarrassingly simple. You put a handful of input-output pairs into the prompt before the real query. The model, doing what it always does, continues the pattern.
Try a concrete case. You want the model to sort customer messages into three buckets: billing, technical, or account. A zero-shot prompt [one that gives no examples] is the instruction alone:
Classify the following message as billing, technical, or account.
Message: My card was charged twice this month.
Category:
A few-shot prompt prepends demonstrations:
Classify the following message as billing, technical, or account.
Message: My card was charged twice this month.
Category: billing
Message: The app crashes when I open my inbox.
Category: technical
Message: I need to change my email address.
Category: account
Message: Why am I being charged after I cancelled?
Category:
The model has not been retrained. No weights have been updated. The only thing that changed is the token sequence the model is conditioning on. From the function-of-tokens picture in Section 3.1, we fed the function a longer input and let the usual machinery produce the usual output distribution. What is striking, and what made the Brown 2020 paper (Brown et al. 2020) a watershed, is that on many tasks this conditioning recovers a large fraction of the accuracy you would get from fine-tuning the model on a dedicated training set. The model behaves as if it learns from the examples, even though nothing inside the model moves.
Brown’s term of art for this is in-context learning: learning-shaped behavior that happens inside the context window [the stretch of text the model can see on this request], on this one request, with no durable change to the model. We used the term already in Section 1.2 when defending the claim that the model does not learn from you. Few-shot priming is the same phenomenon, deliberately engineered. The “learning” is pattern continuation scoped to one request, gone the moment the tokens scroll out.
This matters directly for the paper-summary task. When you paste a forty-page paper and ask for five bullets, the instruction alone is not enough to pin down what shape of bullet you want. Two bullets per section, or five bullets for the whole paper? Each bullet a full sentence, or a verb-starting fragment? Tied to a specific page, or synthesized across the paper? Show the model two worked summaries of unrelated papers with the exact bullet shape you want, and the model copies the shape without needing you to describe it in prose. That is in-context learning doing what in-context learning does.
7.1.1 Where in the training rundown this comes from
We said in Section 3.2 that every prompting technique reaches into one of the three training stages: pre-training, supervised fine-tuning [SFT, training on curated examples of good instruction-following], and preference alignment [a final pass that nudges the model toward answers humans rate higher]. Few-shot is the clearest case of a technique drawing on pre-training specifically.
The model spent months ingesting trillions of tokens of human text. Fifteen trillion tokens of training corpus is, in human terms, about thirty million novels, or roughly every book in every library of a mid-sized city, read end to end, thirty times. A lot of that text has the shape “pattern, pattern, pattern, continuation.” Lists with headers and items. Dictionaries of paired entries. Translation pairs. Problem-and-solution pages from textbooks. Q&A forums. Code with doctests. Predicting the next token in those documents required the model to notice when a local region of text was patterned, extract the pattern, and continue it. Few-shot priming calls that learned capability up by handing the model exactly the shape of input it saw a trillion times during pre-training: a few complete pairs, then an incomplete one.
SFT and preference alignment change what the model does when given instruction-shaped inputs. They do not, by themselves, create the few-shot capability. Ekin Akyürek and colleagues studied this directly in 2023, training small transformers from scratch on toy tasks and then looking inside them (Akyürek et al. 2023). What they found is striking: the models were implementing something mathematically equivalent to gradient descent or ridge regression inside their own activations. A compressed “mini-optimizer” was running at inference time on the examples in the prompt. The model was not looking up the answer. It was solving, from scratch, the tiny learning problem the examples posed, and doing so inside a single forward pass.
That work is on toy linear tasks, not frontier LLMs. Whether Claude and GPT-5 do something analogous, or something entirely different that looks similar at the behavioral level, is an open interpretability question. We do not know. What we do know is the direction: in-context learning is not magic, and it is not a trick bolted on by instruction-tuning. It is an emergent computation that the pre-training loss [the score the model was trained to minimize while reading the internet] shaped the model into being able to perform.
7.1.2 The spectrum: zero, one, few, many
The vocabulary is loose. Brown et al. 2020 distinguished zero-shot (instruction only), one-shot (one example), and few-shot (a handful of examples, empirically up to around 64 in the original paper, though most of the gains saturated well before that). In 2026 usage:
- Zero-shot means no examples in the prompt. The instruction carries the whole load.
- One-shot means one example. Useful when the output shape is unusual and a single demonstration is enough to pin it down.
- Few-shot conventionally means 2 to 32 examples.
- Many-shot is a newer category, introduced by Rishabh Agarwal and colleagues at NeurIPS 2024 (Agarwal et al. 2024), meaning hundreds to thousands of examples, now practical on long-context models.
The spectrum matters because the optimization characteristics change as you move along it. At small k, adding another example often helps substantially. By k = 10 on a well-specified task, further additions tend to produce diminishing returns. At large k, on long-context models, Agarwal found that gains kept accruing on tasks where the knowledge is genuinely hard to compress into a policy: low-resource translation, specialized mathematical reasoning, non-natural-language algorithmic tasks.
Two years ago, “few-shot” meant three examples. On a 2026 model with a one-million-token context window, about the length of eight copies of Crime and Punishment, or the entire text of an undergraduate degree’s worth of reading, “few-shot” can mean five hundred examples. That is not a rebranding. It is a shift in what is possible, and we will take it up again at the end of the next section.
7.2 When few-shot helps
Few-shot priming earns its keep in a narrow but important set of cases. The failure mode of most prompting tutorials is to treat it as a universal upgrade, which it is not. A cleaner framing: few-shot helps when the instruction alone cannot fully specify the task, and demonstration fills the gap.
Three canonical shapes.
Label vocabulary and output shape. You want classification into exactly the labels billing / technical / account. You want dates in ISO format. You want a JSON object with specific keys and no extras. You want bullet points, each starting with a verb, exactly three of them. Describing these constraints in prose is verbose and error-prone. Demonstrating them with two or three examples is compact and unambiguous. The output shape is under-specified by the instruction; examples specify it cheaply.
Domain-specific style or scoring. Rate this research abstract’s methodological rigor on a 1-to-5 scale. Summarize this legal brief in the tone our firm uses. Rewrite this marketing copy in our house voice. You could describe the style in a paragraph, but a sample of the output, with its idiosyncratic cadence, hedging, or terminology, communicates more per token than any description ever will. This is where “few-shot as style transfer” lives.
Unusual mappings. The task is mechanical but not common in training data. Convert these academic citations to our internal reference format. Normalize these SKUs. Extract the second figure from each caption. Demonstration fills in the rule faster than prose does.
The paper-summary task lands in the first two shapes. “Give me five bullets, each tied to a specific page, each starting with the verb the paper uses” is an output-shape problem. Describing that in prose takes a paragraph of careful wording. Showing two worked summaries with the right structure takes the same number of tokens and leaves no ambiguity. If you also want a specific voice, dry for a methods-focused summary, readable for a popular-audience summary, that is the second shape. Both are good matches for few-shot. What few-shot will not do on this task is supply facts about the paper that the paper itself does not state. That job belongs to retrieval (Chapter 10), not demonstration. A reader who asks few-shot to do retrieval’s job will get fluent, confident, subtly wrong bullets, and we already know from Section 1.4 where that failure mode comes from.
Brown et al. 2020 originally claimed that few-shot approached fine-tuning on many benchmarks (Brown et al. 2020), which for 2020-era models was striking. The 2026 operator’s version of the claim is narrower: few-shot is the cheapest way to pull the model’s output into a specific under-specified shape, and it is competitive with fine-tuning only in that restricted sense. When the task requires knowledge the model does not have, few-shot will not supply it; you need retrieval. When the task requires capability the model does not have, few-shot will not manufacture it; you need a bigger or better model. The trick has limits, and the limits are predictable.
7.2.1 How many shots is the right number
For small-k few-shot, a practical default is 3 to 8 examples. Add one or two that deliberately cover edge cases, then stop. Past roughly 10, on well-specified tasks, you are spending context budget for diminishing quality. Every shot is tokens the model has to process and pay for, and you are displacing room for the actual input. On our paper-summary task, three example summaries at 200 tokens each already costs 600 tokens you cannot spend on the paper itself. If the paper is 30,000 tokens long and the model’s context is 200,000 tokens, this is fine. If the paper is 180,000 tokens, those shots matter.
The many-shot regime changes the calculus for specific tasks. Agarwal et al. 2024 reported continuing gains through hundreds and sometimes thousands of examples on Gemini 1.5 Pro and similar long-context models (Agarwal et al. 2024). The places it pays:
- Low-resource translation, where a book’s worth of parallel sentences demonstrably beats a page of them.
- Complex algorithmic or mathematical tasks, where a hundred worked solutions teach distributional features the instruction cannot state.
- Tasks with a large or evolving label space, where enumerating labels in the instruction becomes impractical.
Many-shot does not help on tasks where a well-written instruction plus a handful of examples already saturates the quality curve. Most production classification and extraction falls in that regime. The paper-summary task falls in it too: five worked summaries of past papers do not beat three, because past the third the model has already learned the shape. A sensible 2026 heuristic: start with 3 to 5 shots, measure, escalate to many-shot only if a real eval set shows headroom.
That callout is worth rereading before we move on. Most prompting guides treat each demonstration as a training example and fret over whether each one is correctly labeled. The Min finding reframes the whole thing: a demonstration is not a datum the model updates on. It is a sample from the distribution of valid inputs paired with a sample from the distribution of valid outputs. The model extracts the shape. Follow-ups to Min 2022 have tightened the picture, with a persistent small gain from correct labels on larger models, but the core result has held up, and it remains the most often omitted point in popular few-shot writing.
Here is the practical version for the paper summary. Most operators picking demonstration summaries agonize over choosing papers they have summarized carefully, with perfect bullet labels. Min says they are mis-spending the effort. The higher-leverage move is to choose two demonstration papers whose input shape matches the paper you are about to summarize, lengthy research papers with sections and figures, and two demonstration summaries whose output shape matches what you want, five verb-starting bullets with page cites. Whether each bullet is optimally crafted matters less than whether the paired input and output look like the task you are about to run. Pick for distribution coverage first, polish individual bullets second.
7.3 When few-shot hurts
The failure modes are specific and have been studied in detail. An operator who reaches for few-shot as a reflex will, on certain tasks, degrade performance rather than improve it. Three ways this happens.
7.3.1 Order sensitivity
Yao Lu and colleagues ran a 2022 study that is easy to find discomforting (Lu et al. 2022). The same set of few-shot examples, reordered, could swing a model’s accuracy from near-random to near state-of-the-art. No change to the examples themselves. Just the order. The effect was particularly severe on smaller models, but it has not disappeared entirely on frontier ones.
Why does this happen? Tony Zhao and colleagues had already mapped the mechanism a year earlier (Zhao et al. 2021). LLMs have systematic biases in what they predict at the start of a completion. Three of them in particular. Majority-label bias: whatever class showed up most in the demonstrations, the model over-predicts. Recency bias: the label of the last example gets over-predicted for the query. Common-token bias: tokens the model saw a lot during pre-training get elevated probability regardless of context. The order of examples interacts with all three. Put three billing examples last and the model will hallucinate billing into the next classification. Put an unusual class first and the model may under-predict it for the rest of the prompt.
Here is what that actually looks like in practice on the paper-summary task. Suppose you show the model three worked summaries as demonstrations, and the last one happens to be a methods-heavy summary full of phrases like “the authors establish” and “the experimental setup.” You paste a theory paper that has no experiments at all. The model, primed by recency, reaches for “the authors establish” phrasing and invents an experimental setup that does not exist. Not because the model misunderstood the paper. Because the order of the demonstrations biased its output distribution before it read the paper. The summary will be confident, fluent, and partly fabricated. And the fix is not a better instruction; it is a different demonstration ordering.
Practical mitigations:
- Balance the class distribution across your examples. If you have three labels, include roughly equal counts.
- Randomize order, or measure across permutations. For eval runs, sampling a few orderings and averaging the scores takes the variance out.
- Place the most structurally similar example closest to the query. Recency bias is not just a bug; if your query is a short complaint, a short-complaint example at the end primes better than a long-complaint example does.
- Consider contextual calibration. Zhao et al.’s proposed fix: compute the model’s output on a content-free query (“N/A”, or an empty string) with the same demonstrations, use that output as a bias estimate, then subtract it from the real query’s output distribution. This is heavier machinery than most operators want, but it matters for production systems where few-shot accuracy is load-bearing.
7.3.2 Poor selection
Which examples you pick matters, not just how you order them. Examples that are too easy teach nothing. Examples that are too hard push the model into the wrong corner of its distribution. Examples drawn from one distribution (short messages) when the queries come from another (long messages) give the model the wrong priming. And the selection interacts with calibration: if your examples over-represent one label, you bake in Zhao’s majority-label bias for the whole prompt.
The cleanest pattern here is retrieval-based few-shot. At request time, pull the most similar examples from a larger pool, using embedding similarity how close two pieces of text are in the meaning-as-a-direction space from 3.1 between the query and your candidate examples. This is not strictly a prompting topic; it sits closer to retrieval-augmented generation (Chapter 10). But it is worth naming here. Retrieval-based few-shot is the 2025 production default for anything load-bearing with a well-curated example pool. It dodges order sensitivity by making the examples topically closer to the query, which reduces the variance ordering can introduce. It dodges selection bias by letting the query itself pick its neighbors.
For the paper-summary task, the retrieval version looks like this. You have a library of fifty past papers you summarized well, each with its finished bullet list. At request time, you embed the abstract of the new paper, search for the three most similar past abstracts, and use their summaries as the demonstrations for this call. A theory paper gets theory-paper demonstrations. A methods paper gets methods-paper demonstrations. The model sees examples that match the shape of the thing it is about to summarize, instead of seeing whatever three you manually picked the last time you edited the prompt.
7.3.3 Frontier instruction-tuned models often need less than you think
This is the point where 2020 conventional wisdom and 2026 operator reality have diverged sharpest. On GPT-3, few-shot was the primary way to get decent performance, because the base model was not instruction-tuned and zero-shot instruction-following was weak. On Claude Sonnet 4.6, GPT-5, Gemini 3.1 Pro, and even mid-size open-weight models like Qwen 3.5 14B, the zero-shot baseline on a well-written instruction is dramatically higher, because SFT and preference alignment have already demonstrated good task behavior into the weights.
The practical consequence: for ordinary natural-language tasks, summarization, Q&A, conversational tone, general classification with common labels, the gap between zero-shot and five-shot on a 2026 frontier model is often within noise. Adding examples can still help, but can also hurt. The examples displace context budget, can introduce Zhao’s biases, and can lock the model into a surface form that fit the demonstrations and not the real query.
The claim “frontier models no longer need few-shot” is sometimes stated flatly in blog posts and vendor-adjacent content. Run the Sagan checks on it for a moment. Reproducible? Partly; the zero-shot-near-parity finding shows up in many places, but it is specific to common-label, common-shape tasks. Simpler explanation? Yes: instruction-tuning already taught the model most of the format signal that few-shot used to carry, so the marginal gain of few-shot shrank on exactly the tasks where the format signal was already baked in. Who benefits from it being true? Anyone selling a frontier API that charges by the token: fewer demonstration tokens on every request, same revenue per request. Hold the claim at the correct level of strength. Few-shot is still necessary for idiosyncratic output formats, niche domains, and tasks with a specific label vocabulary. What has changed is the default lever. When a general natural-language task is underperforming on a frontier model, reach for clearer instructions before you reach for more examples.
7.4 Where examples should live
Operators routinely ask where to put their examples. In the system prompt? In the user turn? Split across both? The mechanical answer is that it does not matter to the model. The whole messages array flattens into a single token sequence at the input layer (Section 1.1), and the model conditions on all of it. Putting examples in the role: system message versus the role: user message changes which role tags wrap the tokens but does not change the fundamental few-shot mechanics.
The practical answer is about reusability and cache economics, not mechanics.
Examples in the system prompt are the right home when the examples apply to every request. A customer-support agent that should always classify intents into the same labels. A code-review assistant that should always produce critique in the same shape. The demonstrations encode policy that does not vary per call. Put them in the system prompt and they ship once per conversation, or once per unique system prompt if you are serving many users. On vendors that cache prompt prefixes, they cost you a fraction of a normal prompt after the first request. Anthropic’s recommended pattern for this is the <examples> or <example> block inside the system prompt, with tags (Anthropic 2026).
Examples in the user turn are the right home when the examples are specific to this request. A user asking for a translation provides two reference sentences. A one-off extraction task where the demonstration is drawn from elsewhere in the same document. Any case where the demonstrations would be wrong for the next request.
Stop thinking of the question as “few-shot versus system prompt” and start thinking of it as “reusable versus per-call.” The decision rule gets cleaner. Everything the model sees is the context window. The system prompt is just the part of the context window that ships with every request. Examples that belong everywhere go in the system prompt. Examples that belong on this request live in the user turn.
For the paper-summary task, the distinction is concrete. If you are always summarizing research papers into the same bullet shape, the two demonstration summaries belong in the system prompt, and every call pays for them once via cache. If you are sometimes summarizing research papers, sometimes summarizing legal briefs, sometimes summarizing news articles, the demonstrations are not reusable across requests and belong in the user turn, picked fresh for each call, ideally by the retrieval-based-few-shot trick from the last section.
A second consideration: the system prompt is usually small relative to the full context budget, but there is a ceiling. Past roughly 5 to 10% of the window, system-prompt growth hits diminishing returns and starts to displace useful per-request content. If your demonstrations are large, say 50 many-shot examples of 100 tokens each, adding 5,000 tokens in total, stuffing them into the system prompt on every request is wasteful. Retrieval-based few-shot in the user turn, or a separate tool call that fetches the demonstrations as needed, gets you the benefit at a fraction of the cost.
7.4.1 The system prompt is already doing demonstration work
A careful reading of the system-prompts chapter (Chapter 6) suggests the line between “instruction” and “example” is not as clean as prompting tutorials imply. A system prompt that says:
You are a support agent. Respond concisely. When a user reports a billing issue, always include their account status in the response.
is carrying demonstration-shaped priors even though no explicit example is in sight. “Concisely” is a single word; the model infers from SFT what concise means. “Always include account status” is an instruction, but it works because SFT plus RLHF [reinforcement learning from human feedback, the preference-alignment step that nudges the model toward answers humans rate higher] trained the model on thousands of examples that established “instruction, obedient response” as the dominant continuation. From the model’s perspective, the SFT corpus was one giant, implicit few-shot prompt that runs on every forward pass, regardless of what you typed (Ouyang et al. 2022).
This is why, on a well-aligned chat model, you can often get 80% of the gains of few-shot by writing a more careful instruction. The model has already been few-shot-primed by its training. Your in-context examples add signal at the margin, not from scratch. The paper-summary prompt that simply says “Give me five bullets, each starting with the verb the paper uses, each tied to a specific page” is leaning on thousands of SFT examples the model saw during training where instructions of exactly that shape produced outputs of exactly that shape. You are the latest instance of a template the model has seen uncountable times before.
7.5 Emergence, scaling, and the case against magical thinking
Jason Wei and colleagues made few-shot competence the poster child for the emergent abilities framing in 2022 (Wei et al. 2022): the observation that certain capabilities appear only above a training-compute threshold, roughly 10^22 FLOPs [floating-point operations, a unit for counting how much arithmetic was done during training] for easier tasks and 10^24 for harder ones. Below threshold, a smaller model few-shots at near-random accuracy. Above threshold, performance jumps. The curves in the Wei paper are striking. They look like phase transitions, flat stretches followed by sudden climbs.
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo pushed back hard in 2023 (Schaeffer et al. 2023). Their argument, in a sentence: the appearance of emergence is an artifact of discontinuous metrics. On tasks scored with exact-match or multiple-choice accuracy, where a response either gets full credit or zero, smooth underlying improvement in the model’s token-level probabilities produces apparent step-function behavior at the metric level. Swap in a continuous metric (cross-entropy [how surprised the model was by the correct answer], or token-level exact-match proportion) and the emergence largely vanishes, replaced by smooth scaling curves that the Kaplan and Chinchilla laws we covered in Section 3.2 predicted.
Both camps are correct about something different. Wei et al. are correct that a practitioner running a specific benchmark at a specific scale will observe what looks like a phase transition. Schaeffer et al. are correct that the underlying capability is improving smoothly, and the phase-transition appearance is a metric choice. Both can be right at different layers of what is actually happening. This has three implications for the operator:
- The practical observation is robust. Smaller models, 7B parameters and below in 2026, though the threshold keeps dropping with architectural improvements, few-shot worse than larger models on complex tasks. Sometimes dramatically worse. An operator running Qwen 3.5 4B and expecting it to match Claude Sonnet on a 5-shot domain task will be disappointed. This is not mysticism; it is a scaling reality you can plan around.
- The theoretical claim, a qualitative phase transition, is contested. Nothing special happens at 10^22 FLOPs. The curve is smooth underneath; the scoring is stepped on top.
- Claims of “emergent capability at 2B” in vendor marketing should be read skeptically. Run the Sagan checks. Is it reproducible? Usually on one specific benchmark, which is the first hint. Does a simpler explanation fit? Yes: the benchmark’s metric happens to light up at that scale, not the model’s underlying function. Who benefits from it being true? The vendor selling the 2B model. None of this means the model is not useful. It means the specific rhetorical shape “emergent capability at scale X” is often a metric artifact dressed up as a qualitative leap.
The token-level reformulation point is worth unpacking. The history chapter (Section 3.5) flagged that models cannot count letters in “strawberry” because the word was tokenized [chopped into chunks before the model ever saw it] as one or two chunks, and the individual letters were never in the input (Sennrich et al. 2016). Few-shot does not fix this. Showing the model three examples of “strawberry to 3 Rs” does not teach it to count Rs. It teaches it to produce the number 3 when asked about strawberries. Show it a fourth word whose letter count is different, and the pattern breaks.
The real fix is to ask the model to spell the word first, one letter per token, and then count. That is a reformulation, not a demonstration. Few-shot is a powerful tool, but it cannot reach past the tokenization boundary. When a task is failing because of token-level visibility issues, adding examples will paper over the specific tested cases and break on everything else. The right diagnostic question to ask first is not “should I add more shots?” It is “does the information the model needs exist in its input?” If the information is not in the input, in a form the model can see, no amount of demonstration will conjure it.
For the paper-summary task, the tokenization boundary matters less, because the paper itself is in the input in the form the model expects: words, sentences, paragraphs. What you do have to worry about is whether the specific load-bearing facts of the paper, a citation, a specific number, a named method, are actually present in the tokens you handed the model, or whether the formatting of the uploaded PDF stripped them into something illegible. Few-shot will not rescue a paper whose key numbers got lost in upload. Nothing in-context will.
7.6 Putting it to work
Five takeaways worth carrying out of the chapter.
Few-shot is format-and-distribution priming, not teaching. Min 2022 (Min et al. 2022) is the must-cite. Your demonstrations are shaping the output space, not supplying the knowledge. Act accordingly: invest time in examples that cover the full input distribution and the full label space, and worry less about polishing each individual label.
Reach for zero-shot first on frontier chat models. Measure before adding shots. The instruction-tuned priors are strong enough that the marginal value of a few-shot demonstration is often smaller than the instruction polish you could invest the same effort in.
Order matters, and calibration is a real phenomenon. If few-shot accuracy is load-bearing, sample across orderings on your eval set. Balance label distributions. Consider contextual calibration per Zhao 2021 (Zhao et al. 2021) for production systems where a few percentage points of accuracy mean real money.
Many-shot is the new frontier for hard tasks. Agarwal 2024 (Agarwal et al. 2024) changed what “few-shot” can mean now that long context is standard. For low-resource translation, specialized reasoning, and algorithmic tasks, hundreds of examples keep paying. For ordinary natural-language tasks, they do not. Know which regime you are in.
Examples belong where their reusability is. Reusable across requests: system prompt. Specific to this call: user turn. The model does not care; your token budget and cache economics do.
The thread running under all five is the same thread from Section 3.1: the model is a function from a token sequence to a distribution, and few-shot is one move for staging the token sequence. Like any move, it has regimes where it wins and regimes where it does not, and the operator’s job is to know which regime you are in before the request goes on the wire.