System Prompts | Working with Large Language Models | Synoros

A system prompt is a note you slip to the model before the user speaks, and the model was trained to read that note more carefully than anything that comes after.

That sentence is most of what you need. The rest of this chapter is the mechanism behind it and the practical consequences that follow. Operators treat the system prompt as the place where personality lives, where safety is enforced, where the “real” instructions go. None of those descriptions match anything special about the bytes on the wire. At the model layer, a system prompt is a prefix. A special kind of prefix, because of how the model was trained to treat it, but still just tokens at the top of the messages array, feeding into the same forward pass [the single pass through the model’s layers that turns an input sequence into a probability distribution over the next token] we walked through in Section 3.3.

Once you hold that framing, the interesting questions get sharper. Why does putting instructions at the top work better than putting them at the bottom? Why does Claude respond to XML-tagged sections while GPT does not care as much? Why does Anthropic’s shipping system prompt for Claude 4 run to roughly 23,000 tokens, when the prompt guides from 2023 told you to “be concise”? When should a behavior live in a system prompt instead of a fine-tune or a retrieval pipeline?

Back to the paper summary. When you paste a forty-page research paper into Claude and ask for five bullet points, the system prompt is the invisible first half of your request. The user turn (“summarize this paper”) is the part you see. The system turn is the part you do not. Whether you wrote that system turn or the product quietly supplied its default, the system turn sets the register of the answer (bullets or prose). It sets the model’s appetite for hedging. It decides whether the model invents citations when it cannot find the paper in its weights (Section 1.4). It decides how the model handles the part of the paper that runs to formal proof. Nothing about the forty pages of paper changes how the system prompt behaves. Everything about how the five bullets come back does.

6.1 Where the system prompt sits, and why that matters

A chat completion request, stripped to its bones, looks like this:

messages = [
    {"role": "system",    "content": "You are a careful technical editor..."},
    {"role": "user",      "content": "Review this paragraph for clarity."},
    {"role": "assistant", "content": "<prior response>"},
    {"role": "user",      "content": "Tighten the second sentence."},
]

The server serializes that array into a single token stream using the model’s chat template, a vendor-specific recipe for turning role-tagged messages into the flat sequence the transformer consumes. Every open-weight model ships a chat template embedded in its tokenizer config; every proprietary API applies its chat template server-side. The Hugging Face documentation on chat templates (Hugging Face 2026) is the neutral reference, and it is worth a look once because it breaks the “system prompt as magic field” illusion faster than any explainer.

After templating, our request turns into something like this (for a ChatML-style model, ChatML being OpenAI’s widely-copied flat-text format for wrapping role-tagged messages):

<|im_start|>system
You are a careful technical editor...
<|im_end|>
<|im_start|>user
Review this paragraph for clarity.
<|im_end|>
<|im_start|>assistant
<prior response>
<|im_end|>
<|im_start|>user
Tighten the second sentence.
<|im_end|>
<|im_start|>assistant

That is the input the model sees. One sequence of tokens. The system message is the first chunk in that sequence, wrapped in special delimiters the model was trained to recognize. Nothing else distinguishes it.

So why does placement at the top matter, and why does the literature make such a fuss about it?

Two reasons, both rooted in how the model reads its own context. The first is attention geometry. Every position in the sequence attends to every earlier position, but attention is not uniform across the window. Think of the context window as a long row of seats in a lecture hall. The seats at the front and the seats at the back get remembered; the seats in the middle blur. Serial-position studies on LLMs (Guo and Vosoughi 2025) measured what operators have eyeballed for years: tokens near the start of the input and tokens near the end of the input get used more reliably than tokens in the middle. The middle is the dead zone. Lost in the Middle (Liu et al. 2024) showed the same effect from the retrieval angle. Plant a fact in the 40th percentile of a long context and the model misses it far more often than the same fact at the 5th or 95th percentile. Nobody fully understands why, but the pattern is stable across model families and has not softened with newer architectures. System prompts sit at the front of the window. The last user turn sits at the end. Both positions cash in the primacy and the recency effects. Anything you tuck into the middle is rolling the dice.

The second reason is training. Instruction-tuning and preference-alignment data treats the system position as privileged. OpenAI formalized this as the instruction hierarchy (Wallace et al. 2024): during training, GPT-4o and later were shown adversarial cases where a user message tried to override a system message, and the reward signal taught the model to prefer the system instruction. Anthropic has not published an equivalent paper, but Claude’s shipping behavior reflects the same design. The system message is not just “the first message”; it is the message the model was trained to weight most heavily when instructions conflict. This is why prompt injection [an attacker smuggling instructions into the user-turn content so the model treats them as commands] is structurally hard to solve, as Section 19.1 argues in depth. The attack exploits the gap between a trained preference and a guarantee: the model has learned to favor system-role instructions, but it has not been given a hard rule, because there is no hard rule inside a trained preference.

The system prompt is a prefix, but it is not an ordinary prefix. Two things make it special: attention geometry rewards content at the front of the window, and instruction-hierarchy training teaches the model to weight system-role tokens above user-role tokens when they conflict. Neither is a hard constraint. Both are reliable enough to build on, and neither is strong enough to treat as security.

Pause on one number before moving on. Anthropic’s shipping system prompt for Claude 4, as of the May 2025 system card, is roughly 23,000 tokens (Anthropic 2025b; Willison 2025). Twenty-three thousand tokens is about sixty pages of prose. Imagine a short novella the model reads in silence before it sees the first word you type. That is over 11 percent of Claude’s 200,000-token context window, spent before anyone types a character. For the paper-summary task, the arithmetic is real: by the time your forty-page paper lands in the user turn, the model has already read Anthropic’s sixty-page note about how to be Claude. Everything after that is a conversation with a model that has been primed by a short book you never wrote. The 2023-era advice to “keep your prompt short” stopped applying somewhere around GPT-4-turbo. Production system prompts at frontier labs are long, detailed, version-controlled, and evaluated. The rest of this chapter is about why, and what the long ones contain.

6.2 What actually works in a system prompt

The useful exercise is not to memorize a recipe. The useful exercise is to read system prompts that real labs ship and ask which moves do the work.

Anthropic has published the text of several Claude system prompts since August 2024 (Anthropic 2026b), and Simon Willison has been annotating them in public as they update (Willison 2024, 2025). Read either of those once with an eye for pattern, and the list below will stop feeling arbitrary.

Back to our paper-summary task. Picture writing the system prompt that will govern every “summarize this paper” call your team ever makes. What actually earns its tokens in that prompt?

Behavioral constraints earn their tokens. Lines of the form “never do X,” “always Y before Z,” “if condition, refuse” cost tokens but steer the model through whole categories of failure. Anthropic’s Claude 4 prompt contains constraints about how to handle image generation requests, what to do with certain identity claims, how to attribute sources. Each of those lines is there because something measurable went wrong in testing without it. The operator version is the same. A production CLAUDE.md the author maintains for agent work contains “never SSH to servers manually without permission” and “never reference NVMe device names” because both have burned the author in past runs. For the paper-summary prompt, the constraint might read “never invent a page number if the paper does not have one you can cite.” Specific prohibitions, stated once, with a clear condition, beat general exhortations.

Output format specification earns its tokens. If the downstream of your prompt is a parser [the software that reads the model’s reply and tries to extract structured pieces from it], say so. If the model should use markdown headings, say so. If the assistant’s reply must include certain sections, enumerate them. The SFT stage supervised fine-tuning, the step where the model learns from human-written example conversations, covered in 3.2 taught the model to take format instructions seriously. This is one of the highest-signal levers you have, and it is almost free in tokens. For the paper summary, “five bullets, one sentence each, every claim tagged with the page number it came from” does more work than any amount of polite prose.

Role and audience scoping earns modest tokens. “You are helping a senior software engineer who wants terse answers” or “you are writing for an intelligent non-technical audience” shifts register and verbosity reliably. What does not earn its tokens is the “you are the world’s greatest expert in…” trope. Studies of role prompting find small stylistic effects and near-zero effects on reasoning accuracy once you hold task difficulty fixed. Weng’s survey (Weng 2023) was restrained on this point in 2023, and the evidence has not shifted. Use role prompts to set voice, not to bluff the model into being smarter. Telling the model it is “the world’s leading expert on molecular biology” before handing it a molecular-biology paper will nudge the register of the bullets, nothing more. If the model does not know the content, the role label will not teach it.

Tool descriptions earn their tokens when tools are in play. Any system whose model has tool access needs the system prompt to do the scope-and-precedence work: when to reach for which tool, what the tools cost, what the tools do not do. We come back to the mechanics of tool use in Chapter 12; here the point is that tool descriptions are not optional documentation, they are load-bearing instruction. For a paper-summary agent that can also fetch referenced papers, the system prompt has to tell the model when to leave the summary of the main paper and reach out for a reference, or the agent will either fetch everything or fetch nothing.

What does not earn its tokens:

Encouragement has no measurable effect at frontier scale. “You are brilliant and careful” or “take a deep breath and think step by step” were 2023 folk moves that had measurable effects on 2023-era models and then stopped working (Weng 2023). The instruction-tuning of 2024 onward absorbed the best of those priors. The marginal gain from repeating them shrank to noise. They are not harmful. They are vestigial. On smaller local models (below roughly 7B parameters), some of these moves still produce small lifts; on a frontier API model, they are camouflage.

Worth running Sagan’s skeptic checklist against that last paragraph, because this is exactly the kind of claim that tends to get recycled from 2023 posts and re-injected into 2026 prompts. Is “take a deep breath and think step by step helps the model” reproducible on current models? Reports of its effect shrinking to noise have held up across multiple studies. Does a simpler explanation fit? Yes, the behavior the trick used to summon (careful multi-step generation) is now default behavior on instruction-tuned frontier models. Who benefits from the trick still being true? Prompt-engineering course sellers and LinkedIn posters who learned it three years ago. The honest read is that the trick is folklore. Leave it out.

Long personas unrelated to the task waste tokens. “You are a 47-year-old British archaeologist named Clive who…” is ornament. Unless the persona is the task (creative writing, role-play products), the model will apply the role’s voice and discard most of the rest under attention dilution. The paper-summary bullets do not get better because Clive is reading the paper.

Redundant statements of the same rule harm more than help. Anthropic’s public engineering doctrine on system prompts (Anthropic 2025a) is blunt on this: minimize functional overlap, maximize token efficiency. When three different sentences instruct the model to “be concise,” the model has three competing signals to resolve. The usual result is worse behavior on both the cases where one rule fits and the cases where another does, because the model is now averaging across a tangle.

High leverage, in order. Behavioral constraints. Output format. Tool descriptions. Role and audience scoping. Everything else is decoration. When the prompt is not producing the behavior you want, the first question is whether one of these four is missing, not whether to add more encouragement.

A worked example makes the point concrete. The same operator CLAUDE.md the author maintains opens with a block headed BEFORE YOU ACT, flagged as overriding default behavior, with seven numbered rules the agent is expected to check before starting any task. Each rule is a specific behavioral constraint: plan first for non-trivial work, keep context clean, verify before marking complete, ask rather than infer, sync docs, record lessons, push back when you disagree. The wording is declarative. The conditions are concrete. There is zero encouragement. Every rule exists because a prior failure made it necessary. Below that block, a NEVER DO block enumerates hard prohibitions. Then a tool-description block, structured as fallback orders so the model knows which tool to try first and which to treat as a last resort.

That structure is not an accident, and it is not unique to one operator. The same pattern shows up in Anthropic’s own system prompts, in OpenAI’s cookbook examples, in the OpenAI prompt-engineering guide (OpenAI 2026; Anthropic 2026a), and in production prompts the labs share privately with enterprise customers. Hard constraints at the top. Format and tool guidance in the middle. Examples and nuance at the bottom. That ordering maps one-to-one onto the attention-geometry and instruction-hierarchy mechanics we just covered. The front of the window gets the constraints because the front of the window is remembered best. The middle gets the softer guidance because the middle is where the signal thins. The rewriting of a paragraph in this way is not a style choice; it is the shape the mechanics force.

6.3 Cross-model differences

Here is the pain point that prompt libraries and “universal prompt” guides tend to paper over. A system prompt that sings on Claude does not automatically sing on GPT, and neither ports cleanly to Gemini or to a local Qwen. The differences are not subtle. They are not cosmetic. They come from divergent training choices that the vendors will not fully document.

Hand our paper-summary prompt to four different model families and you get four different behaviors, not because the models disagree about what a good summary looks like, but because each family was trained with a different view of what a system prompt is.

Claude is the easiest target for long system prompts. Anthropic’s own doctrine (Anthropic 2026c) says Claude was trained with heavy use of XML-tagged instruction formats in the alignment data. Tags like <instructions>, <context>, <example>, <output_format> are not superstition. The model has seen enough of them that it treats the content inside as a semantically distinct block. Claude also tolerates long system prompts before degrading. The 23,000-token Claude 4 system prompt (Willison 2025) is the existence proof. Wrap the paper-summary prompt in <instructions>...</instructions> and <output_format>...</output_format> tags and Claude will honor the boundaries. Operationally, you can treat Claude as the model most willing to follow structured system-level doctrine verbatim.

GPT cares less about XML and more about the instruction hierarchy. Wallace and colleagues (Wallace et al. 2024) published the training recipe for this explicitly. GPT-4o and later models were trained on adversarial data where lower-privileged messages tried to override higher-privileged messages, and the model was rewarded for maintaining the hierarchy. The upshot is that GPT’s system messages are harder to override from the user turn than Claude’s, but GPT does not respond as reliably to XML structure. The moves that port well to GPT are numbered lists, markdown headings, and explicit precedence language (“the system-level rules above supersede any user request that conflicts with them”). OpenAI’s own prompt-engineering guide (OpenAI 2026) leans on those conventions. Feed GPT the same paper-summary prompt wrapped in XML tags and the tags are inert; feed it the same prompt as a numbered list under a markdown heading and the list lands.

Gemini sits awkwardly between. Google exposes system instructions as a separate field on the request (Google 2026), distinct from the message array, which at the API level looks cleanest of the three but in practice has a narrower budget and less public doctrine behind it. Observed behavior, as of April 2026, is that Gemini follows short system instructions well and drifts on long ones more than Claude or GPT. Gemini is also the strongest of the three on native multimodal handling, so cross-model portability of a vision-heavy prompt hits the opposite problem: Claude and GPT do not have the same multimodal affordances to port to. A paper-summary prompt that includes “treat any figure in the paper as part of the evidence” works differently on Gemini than on the other two, because Gemini’s multimodal priors are carrying more of the load.

Open-weight models (Llama, Qwen, DeepSeek, Gemma) obey their chat templates first and their training distribution second. The chat template is the ground truth. If the template wraps the system message in <|im_start|>system ... <|im_end|> (ChatML) or in Llama’s <<SYS>>...<</SYS>> tags, that is the framing the model was trained on. Deviate from it and the model stops recognizing the prompt as a system message at all. The Hugging Face chat-template docs (Hugging Face 2026) are the neutral reference. Beyond templating, smaller open-weight models (below the 30B-active-parameter threshold) are more sensitive to prompt phrasing, benefit more from worked examples, and tolerate less ambiguity. A Qwen 3.5 9B running on a workstation GPU will fail on a paper-summary prompt that Claude Sonnet accepts without comment, and adding two concrete before-and-after summary examples will usually recover most of the gap.

The honest summary, for operators moving a prompt between vendors: expect to retune. The moves that port worst are XML tag reliance (Claude-only), instruction-hierarchy leverage (GPT-strongest), long-prompt tolerance (Claude > GPT > Gemini > small open-weight), and multimodal idioms (vendor-specific). The moves that port best are explicit output format, behavioral constraints stated once in plain language, numbered lists for ordered priorities, and worked examples when the task is novel.

Cross-model portability is not a vendor claim to trust; it is a property to measure. Take the same prompt, run it against every target, and log the behavior delta. The alternative is discovering the difference in production when you switch providers for cost or latency and your downstream breaks.

The Synoros operator demo for this chapter is exactly that exercise. One system prompt: “summarize the paper in five bullets, no speculation, page-number cites for every claim.” One user turn: the same forty-page research paper on every run. Four models: one frontier model from Anthropic, one from OpenAI, one from Google, and one workstation-class open-weight model. Matched temperature [the knob that controls how confident the model’s next-word sampling is; higher means more varied output, lower means more deterministic]. What comes back is not the same shape on every model. One invents a citation for a paper it claims the main paper cited. One refuses to name a page unless the prompt tells it the paper has page numbers. One produces six bullets and apologizes for the extra one. One produces four and omits the methodology without explanation. The behavior delta is not uniform across models. It is not even uniform within a model across runs of the same prompt. The figure lives in the chapter’s companion notebook so readers can reproduce it on their own hardware. The takeaway is the methodology, not the specific numbers, because the specific numbers will shift with the next release cycle.

6.4 System prompts vs fine-tuning vs RAG

When a system prompt alone cannot pull the behavior you need out of the model, the next choices are fine-tuning [further training the model on your own dataset so its weights bake the behavior in] or retrieval-augmented generation [fetching relevant documents at request time and pasting them into the prompt so the model can read them], or some combination. These three are often presented as competitors. They are not. They solve different problems, cost different amounts, and interact in predictable ways.

Back to the paper summary. Three different things might be wrong with the bullets coming out of your summary prompt, and the right fix depends on which. If the bullets come back too long no matter what the prompt says, the fault is in the prompt; tighten it. If the bullets keep using the wrong vocabulary for your company’s field (biotech? law? a niche engineering subfield?), the fault is in what the model learned during training; a fine-tune can help. If the bullets get the paper’s claims wrong because the paper cites a 2025 method the model has never heard of, the fault is that the relevant knowledge is not in the weights; retrieval can fix that.

A clean frame:

A system prompt changes behavior per request. The cost is tokens and the time to write them. The cost to change is seconds.
Fine-tuning changes behavior per model. The cost is a dataset of examples, a training run, and the discipline to evaluate the result before shipping. The cost to change is a new training run.
RAG (retrieval-augmented generation, Lewis et al. 2020) changes what the model sees per request. The cost is an index, a retrieval pipeline, and the engineering to plumb the retrieved content into the prompt. The cost to change is reindexing.

Those three levers push on different parts of the pipeline from Section 3.2. Fine-tuning edits the weights. RAG edits the input. System prompts sit between: they are input, but they are input the model was specifically trained to treat as authoritative.

When should you reach for each?

System prompts first, always. They are the fastest, the most auditable (the prompt is the program, visible to anyone who opens the file), and the most reversible (a bad system prompt gets reverted with an editor; a bad fine-tune costs another training run). Every other mechanism should start from a properly written system prompt and measure what it does not solve.

Fine-tuning earns its cost when one of three conditions holds. First, when the behavior must persist across thousands of prompts and you cannot afford to resubmit the instruction every time. Second, when the behavior involves a style, domain register, or output format that cannot be described efficiently in words but can be demonstrated in hundreds of examples. Third, when the behavior is so idiosyncratic (a specific dialect of a company’s product language, for instance) that the base model’s priors actively fight it. InstructGPT (Ouyang et al. 2022) was fine-tuning at civilizational scale; what you do at application scale is the same mechanism at a much smaller data volume. If you can describe the change in two sentences and a system prompt gets you 90 percent of the way there, fine-tuning is not the first move.

RAG earns its cost when the information the model needs is not in its weights. Company-specific documents. User-specific data. Anything that changed after the training cutoff. Anything too voluminous, too specific, or too private to fit in a training corpus. RAG’s role is orthogonal to system prompts: a RAG system still has a system prompt, and the system prompt’s job in a RAG system is to tell the model how to use the retrieved content, how to cite it, what to do when retrieval returns nothing relevant. The retrieval pipeline and the prompt shape are separate engineering problems that happen to share a pipeline. Chapter 10 unpacks both in depth.

The question worth asking before any of these: what prior, set during which training stage, is this request pushing against? If the user wants the model to produce JSON and the model keeps falling back to prose, the prior is an SFT format default. A system prompt or a constrained-decoding harness a runtime wrapper that forces the model to pick only tokens that keep the output valid JSON, covered in 8 will fix it. If the user wants the model to recall a proprietary product name, the prior is the absence of a fact from pre-training. No system prompt will fix that; RAG will. If the user wants the model to refuse a category of request the vendor’s alignment explicitly trained the model to accept, the prior is preference-stage behavior, and even a system prompt instructing refusal may be unstable because the preference training pushes in the other direction. Fine-tuning, on a base or lightly-tuned model, is the real fix in that case.

One practical observation worth naming. There is a prompt-length ceiling past which system-prompt additions hit sharply diminishing returns. The mechanism is the same Lost in the Middle effect (Liu et al. 2024) we hit earlier: as the system prompt grows, its middle expands, and content in the middle is used less reliably. Anthropic acknowledges this explicitly in their context-engineering doctrine (Anthropic 2025a), which is part of why even their 23,000-token prompt is structured as short, distinct blocks rather than flowing prose. The ceiling varies by model and by task, but the operator rule of thumb is that past roughly 5 to 10 percent of the context window, each additional system-prompt token starts to displace useful context from the user turn and the retrieved documents. For a 200,000-token window, that is 10,000 to 20,000 tokens of headroom before the marginal token starts costing more than it gives. Measure this for your own workload, because the number shifts with model generation, and the shape of the degradation is workload-specific.

6.5 Treating the prompt as code

If you accept the argument so far, the system prompt is not a trick. It is a program, written in natural language, that runs on a statistical runtime. The program shapes behavior. The runtime executes. The output depends on both. Programs that shape behavior get versioned. Programs that run in production get evaluated. Programs that affect many downstream consumers get reviewed.

That reframe has implications operators in 2026 are still catching up to.

Prompts belong in version control. Not in a Notion doc, not in a prompt-library sidecar tool, not pasted into Slack. The prompt is the highest-leverage configuration in the system. Treat it like the other highest-leverage configuration in the system. A diff on a prompt tells a future reader why a rule exists. A revert is a one-line operation. A blame on a misbehaving line surfaces the author. None of those are available on prompts stored outside a repo. For the paper-summary service, that means the exact string that governs “summarize this paper in five bullets” lives in git alongside the rest of the service code, with the history of every change it has been through.

Prompts deserve evaluations. Not “I tried a few examples and it felt good.” A small curated set of prompts-and-expected-behaviors, run before and after each prompt change, with the delta reported in the commit. For the paper-summary prompt, an evaluation set might be twenty real papers across three fields with hand-written reference bullets; every change to the prompt re-runs against those twenty and reports the shift. DSPy (Khattab et al. 2024) and MIPRO (Opsahl-Ong et al. 2024) go further by compiling prompts against an evaluation set automatically; those tools are worth knowing about, but the simpler and more important move is just having an eval set. Without evals, every prompt change is a vibe commit. With an eval set, prompt changes have measurable outcomes and the team can hold a position on what to ship.

Prompt libraries are an antipattern for production. The argument for a library (“we will reuse this system prompt across five applications”) rarely survives contact with the applications. Each use case has slightly different constraints, slightly different tool access, slightly different user audience. A library forces a lowest-common-denominator prompt that fits none of them. A per-project prompt, checked into the project’s own repo and evaluated against the project’s own task, is shorter, sharper, and easier to debug. The thing worth sharing across projects is the pattern (behavioral constraints first, format specification second, tool descriptions third), not the prose.

Prompts age. Model releases shift priors. A behavioral constraint that was necessary on Claude 3.5 Sonnet may be redundant on Claude 4.7 because the alignment tuning between versions absorbed the behavior into the base. The only way to discover which of your rules are now redundant is to re-evaluate on a rhythm and delete what the model no longer needs. Anthropic’s decision to publish its own system prompts and keep updating them (Anthropic 2026b) is partly so operators can compare versions and see which constraints the vendor itself has stopped needing. Your paper-summary prompt from two years ago probably contains rules that earn nothing on today’s models and cost tokens every call. Prune.

The prompt is the program. The model is the runtime. That framing from Section 3.5 returns here with a sharper edge. Programs live in repos. Runtimes change underneath them. Good engineering practice is the same for prompts as for any other high-leverage configuration: version it, test it, review it, and prune it when the runtime has moved on.

Anthropic. 2025a. Effective Context Engineering for AI Agents. Anthropic engineering blog, Sept 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.

Anthropic. 2025b. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic, May 2025. https://www.anthropic.com/claude-4-system-card.

Anthropic. 2026a. Prompting Best Practices (Claude API Docs). Current product documentation. https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices.

Anthropic. 2026b. Release Notes: System Prompts. Claude API Docs (first published Aug 2024, updated continuously). https://platform.claude.com/docs/en/release-notes/system-prompts.

Anthropic. 2026c. Use XML Tags to Structure Your Prompts. Claude API Docs. https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/use-xml-tags.

Google. 2026. Gemini API: System Instructions. Google AI for Developers. https://ai.google.dev/gemini-api/docs/system-instructions.

Guo, Xiaobo, and Soroush Vosoughi. 2025. “Serial Position Effects of Large Language Models.” Findings of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2406.15981.

Hugging Face. 2026. Chat Templates. Hugging Face Transformers documentation. https://huggingface.co/docs/transformers/main/chat_templating.

Khattab, Omar, Arnav Singhvi, Paridhi Maheshwari, et al. 2024. “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.” International Conference on Learning Representations (ICLR 2024). https://arxiv.org/abs/2310.03714.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.11401.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

OpenAI. 2026. Prompt Engineering Guide. OpenAI platform documentation. https://platform.openai.com/docs/guides/prompt-engineering.

Opsahl-Ong, Krista, Michael J. Ryan, Josh Purtell, et al. 2024. “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” Proceedings of EMNLP 2024. https://arxiv.org/abs/2406.11695.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.

Wallace, Eric, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. OpenAI, arXiv:2404.13208. https://arxiv.org/abs/2404.13208.

Weng, Lilian. 2023. Prompt Engineering. Lil’Log. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/.

Willison, Simon. 2024. Anthropic Release Notes: System Prompts. Simonwillison.net, 26 Aug 2024. https://simonwillison.net/2024/Aug/26/anthropic-system-prompts/.

Willison, Simon. 2025. Highlights from the Claude 4 System Prompt. Simonwillison.net, 25 May 2025. https://simonwillison.net/2025/May/25/claude-4-system-prompt/.