Skills, Hooks, and the Orchestration Layer | Working with Large Language Models | Synoros

Everything that looks like an AI agent is a regular program that calls an autocomplete, over and over, and decides what to feed it each time.

Chapter 12 (Chapter 12) left us with a small piece of architecture. A stateless model. A tool schema [a JSON description telling the model what functions exist and what arguments they take]. A dispatcher [the harness code that reads a tool-shaped output and runs the matching function]. And a loop that keeps feeding the model whatever the tools produce. That loop is orchestration [the code that sits around a model call and decides what goes in, what happens, and what comes out] in its smallest possible form.

In a real system that loop grows. A lot. Picture all the code that has to live around one call to Claude when the call matters. Retries when the network flakes. A cheaper backup model when the expensive one is down. Retrieval, to stage the right documents. A scrubber that strips secrets before anything leaves your building. Policy checks. Hooks at specific lifecycle events [named moments in the model call where the harness pauses and gives your code a chance to run]. Instruction bundles that load for some tasks and not others. Subagents to go do side jobs. Timeouts. Token budgets. None of that is the model. All of it is code that runs around the model call.

Vendors have shipped four abstractions in 2025 and 2026 for structuring that surrounding code: skills, hooks, slash commands, subagents. The principles will outlast the syntax. Every specific recipe naming a vendor product will be wrong within twelve to eighteen months. The vendor docs will have moved. The file formats will have renamed themselves. Two of the products named may have folded into each other by the time you read this. The shape of the problem does not move.

The shape of the problem is the stateless function thesis from Chapter 1 (Section 1.1), applied one more time. The model has no state, no memory, no will. Everything that makes a production system feel coherent, a skill “knowing” when to apply itself, a hook “catching” a bad push, a subagent “handing off” a task, is harness code deciding which tokens to stage into the model’s context window on which call. Skills do not live in the model. Hooks do not live in the model. There is no agent. There is a program that calls the model, stages context, inspects the output, and decides what to do next. You are writing that program whether you realize it or not.

Keep the paper in view. You paste a forty-page research paper into a chat window and ask for five bullet points. In the simplest setup that is one model call and you are done. In anything more serious, the harness is doing real work around that call. It notices this looks like a paper-summary task and loads a summarize-research-paper skill, a workflow bundle the vendor would rather not bake into the base model. It runs a hook that watches the output for mention of a private study the paper cites. It spawns a subagent to fetch the references the paper lists, so the final summary can say which ones the model actually read and which ones it only guessed about. The user sees “one answer.” The orchestration layer ran half a dozen decisions to produce it.

13.1 What the orchestration layer is

Every LLM-backed product has an orchestration layer. Even the ones that are fifty lines of Python glue plus a try/except have one. The layer is whatever code sits between the user’s request and the model’s raw output, and does at least one of these jobs:

Decides when to call the model, and with what context
Decides which model (and which tools, retrieval, memory) to hand that context to
Inspects or transforms the model’s output before it reaches the user or another system
Handles failure: timeout, schema-invalid output, tool exception, rate limit, policy violation
Enforces invariants the prompt alone cannot reliably enforce

Toolformer (Schick et al. 2023) and ReAct (Yao et al. 2023) are the academic origins of this layer. Both studied it as a property of the model itself (can a model be trained to call tools? can prompting elicit interleaved reasoning and action?), but the implementation lesson operators keep relearning cuts the other way. The reliability of the overall system sits with the glue code, not with the model’s intrinsic tool-use ability. The model calls the wrong tool sometimes. The glue decides what happens next. The model hands back malformed JSON sometimes. The glue decides whether to retry, repair, or fail loudly.

The glue has had different names at different times. Agent framework. Runtime. Harness. Orchestrator. The terminology is noisy; the thing itself is stable. It is the code outside the model call. Call it the harness and move on.

Three questions organize the design of any harness:

What goes into the context for this call? Retrieved documents, tool schemas, prior turns, system prompt, memory, examples. This is the context-staging problem from Chapter 10, repeated per call with more moving parts.
What is allowed to happen on this call? Which tools can the model invoke. Which network calls. Which file writes. Which model. Allowlists, denylists, policy gates.
What do we do when the model produces something we did not expect? Retry, repair, escalate, abort, alert. Failure handling.

Most production bugs in LLM systems sit at one of these three seams, not in the model itself. Hand the paper-summary task a good prompt and good context and the model will produce reasonable bullets nine times out of ten. The tenth time, somebody on the harness side truncated the paper to fit the context window without telling anyone, or silently swapped to a cheaper model that handles long context worse, or stripped a hook that was meant to re-run bad summaries. The model was not “having a bad day.” The harness was. Knowing which of the three seams you are looking at cuts debugging time sharply.

The orchestration layer is where the bugs live. The model is not moody. It got handed a different context, or its output got handled differently. Every “the model got worse” or “it was working yesterday” complaint in a deployed system resolves at this layer more than half the time. Log the exact prompt and the exact output for every call, or you are debugging blind.

13.2 Skills, hooks, commands, subagents: four names for four patterns

By 2026 the vendor surface has converged on roughly four abstractions for packaging orchestration behavior. They overlap. Different vendors name them differently. Any specific vendor’s syntax will be stale by the time you read a book chapter about it. The patterns themselves are what matter.

13.2.1 Skills: loaded-on-demand instruction bundles

Picture the failure first. You want the same model to summarize a medical paper, review a legal contract, and code-review a pull request. Each task has its own workflow, its own vocabulary, its own “watch out for this” list. You have three bad options.

Option one: stuff all three workflows into the system prompt. Now every call pays the token cost of instructions it did not need. A casual “what’s the weather” chat drags along three pages of contract-review guidance it will never use. Option two: branch on the task type in application code and pick one prompt per branch. Now you have re-implemented the same branching pattern inside every application you ship. Option three: let the model figure out what to do from scratch each time. That is how you lose the reliability you were shipping the prompt for in the first place.

A skill is the shared fix. It is a bundle of instructions, optionally with accompanying scripts, that the harness loads into the model’s context only when the task looks relevant. Anthropic’s October 2025 release (Anthropic 2025b, 2025a) set the archetype. A skill is a directory with a SKILL.md file whose frontmatter describes what the skill does, and a body containing the instructions. OpenAI has been reported to be building a similar export path for GPTs. Agent Skills was donated to the Linux Foundation’s Agentic AI Foundation in December 2025 as an open standard.

The mechanism is straightforward once you say it plainly. The harness keeps a registry of skill descriptions. At each turn, it checks whether any skill’s description matches the current task, usually by asking the model itself, using the short descriptions staged into the system prompt. On a match, the full skill body slides into the context for that turn, and the model proceeds with those instructions in view.

Here is what that looks like for the paper. A well-written summarize-research-paper skill bundles the workflow you want the model to follow: scan the abstract, identify the contribution, flag methodology concerns, note which claims ride on which citations, output five bullet points with page-number anchors. Write it carefully and the body runs several thousand tokens, roughly a long magazine article’s worth of text. Without the skill pattern, those tokens either live in the system prompt on every call (expensive on casual chats, and wasteful the 95% of the time when the user is asking about something else) or get re-typed by the user on every paper (tedious, and inconsistent because nobody types the same instructions twice). The skill loads only when a paper-summary task shows up. The casual-chat calls pay nothing for it.

Two properties make this pattern useful. First, skill bodies can run long (hundreds or thousands of tokens) without permanently sitting in the system prompt, because they only load when relevant. Second, skill content is version-controlled text, so the operator knows exactly what instructions the model saw on a given call. Both properties address real failure modes: the system prompt bloating with every new capability, and the audit trail of “what did you tell the model” going opaque when instructions hide inside a vendor’s closed product.

Skills are a push abstraction. The harness pushes instructional content into context, usually with the model’s help identifying relevance. The model did not “ask for” the skill; the harness loaded it because the task matched. This distinguishes skills from tools, which are pull: the model sees tool schemas and chooses to invoke them (Chapter 12).

13.2.2 Hooks: deterministic rules around stochastic calls

A hook is code that runs at a specific lifecycle event of the model call. Typical events include PreToolUse (before a tool runs), PostToolUse (after), UserPromptSubmit (before the prompt reaches the model), Stop (when the model indicates it is done), and a variety of others depending on vendor. Claude Code’s hooks documentation (Anthropic 2026a) defines one concrete surface; other harnesses ship similar primitives under different names.

Hooks matter because they turn requests into rules. “Always run lint before a git push” stated in the system prompt is a request the model may or may not follow. “Always run lint before a git push” implemented as a PreToolUse hook on the git tool is a rule the harness enforces, whether the model wants to or not. Same content, two different layers of the stack, sharply different reliability guarantees.

Here is an operator heuristic worth memorizing: if you can write a unit test for the failure, you can write a hook that catches it. A prompt injection that smuggles an instruction to exfiltrate a file? A PostToolUse hook inspects every file path the model tried to read and blocks anything outside the project directory. A tool call that blows the time budget? A pre-tool hook denies the call before it fires. A response that leaks a secret? A Stop hook redacts the secret and re-runs the response. None of these require the model to behave correctly. They only require the harness to behave correctly, and harness code is code you can test.

Bring this back to the paper. In a bare summary workflow, hooks rarely fire. The task is self-contained. Read the paper, produce bullets, done. The picture changes the moment the harness grants the model tool access. A “summarize this paper and file the summary to my notes” workflow wraps real side effects. Now a PostToolUse hook that refuses writes outside ~/notes stops a single bad interpretation from scribbling over unrelated directories. Imagine the model reading the paper’s references, finding the string ~/projects/rewrite-this-file.md inside a code example the paper quotes, and helpfully deciding to follow that instruction. The hook sees the write-attempt, notices the path is not under ~/notes, and refuses. That is hooks doing the work they exist to do: containing the blast radius [the set of files, services, and users a single bad call can damage] of a call that strayed.

This is the canonical place where determinism enters an otherwise stochastic system. The model runs probabilistic. The hooks do not. The hooks are code.

13.2.3 Slash commands: user-invoked prompt templates

A slash command is a prompt template the user invokes deliberately, typically with a leading slash in the chat surface. /review, /commit, /deploy. They are skills’ twin, but flipped. User-triggered instead of auto-invoked. And usually without the model having to infer relevance, because the user already said what they wanted. Anthropic’s Slash Commands SDK (Anthropic 2026b) ships one implementation; ChatGPT, Gemini, and Cursor all ship variants.

The difference from skills is intention. A skill loads when the model or harness detects relevance. A slash command loads when the user asks. Slash commands run cheaper (no relevance-detection round-trip), more predictable (the user knows what they invoked), and less powerful (nothing happens unless the user types the magic word). In practice, operators use both. Slash commands for common deliberate tasks, skills for common situational ones. A user who summarizes a paper every morning might prefer typing /paper-summary as a one-word trigger, even though a skill could have inferred the same intent from the paste alone. The user gets to pick when the ritual is explicit and when it is ambient.

13.2.4 Subagents: isolated context windows for sub-tasks

A subagent (also: sub-task agent, child session, worker) is a separate model call, usually in a fresh context window, often with a restricted tool set, spawned to handle a specific sub-task. The parent session dispatches the sub-task with a tight scope. The subagent runs to completion. The parent receives a summary rather than the full transcript. Subagents are how large jobs get broken down without the parent’s context window blowing up.

No canonical academic paper exists for this pattern. It falls out of the decode-loop substrate. If a task needs ten tool calls, each with its own retrieved context, and you run them all in one session, the context window fills up with material the final summary does not need. Spawn a subagent for each call, let it summarize its result, discard the scratch context, and the parent stays clean. Claude Code, OpenAI’s Agents SDK, and Cursor’s background agents all ship variants.

Watch what happens when the paper-summary task gets ambitious. Suppose the paper cites thirty references and you want the final summary to reflect which of those the model actually checked. The naive approach fetches all thirty in the parent session. Now the context window holds the forty-page paper plus thirty fetched abstracts, most of which will not inform the final bullets. Your bill scales with the garbage. Worse, lost-in-the-middle [the measured tendency of long-context models to neglect information buried in the middle of their context window] kicks in, and the bullets the model finally produces may quietly miss the parts of the paper that got buried under fetched references. A subagent pattern fixes this. For each reference, spawn a subagent with just “here is the paper’s claim on page 7; does this reference support it?” in its context. The subagent fetches the reference, answers in one line (“supports” / “does not support” / “tangentially related”), and dies. The parent collects thirty one-line verdicts and uses them to anchor the five final bullets. The parent never sees the thirty fetched abstracts. Its context stays clean.

All four abstractions reduce to the same operation. Stage the right tokens in the right context on the right call. Skills stage instructions on demand. Hooks stage enforcement code around the call. Slash commands stage instructions on user-triggered demand. Subagents stage a separate context for a bounded sub-task. The model remains a stateless function. The harness remains the program. Four abstractions, one substrate.

All four reduce to context staging. If you remember nothing else from this chapter, remember that the model does not “have skills” or “respect hooks” or “understand subagents.” The model has a context window. The harness writes to it. Four names for four disciplines in which direction to write.

13.3 When to invest in orchestration, and when to just prompt harder

This is the decision most teams get wrong. Orchestration infrastructure is real code. It has bugs. It needs tests. Its maintenance burden competes with the maintenance burden of the rest of your product. Building elaborate skill and hook scaffolding around a workload that a 200-token system prompt could handle is the LLM-era equivalent of standing up a Kubernetes cluster to serve a static site.

Four conditions, any of which tips the calculation toward orchestration investment:

The same prompt ships many times per day. If a bad output costs little and it fires twice a week, just prompt. If it fires two hundred times a day and a small percentage of failures leaks to customers, the cost of the next bug exceeds the cost of proper orchestration. Write the hooks.
Failure modes are deterministic and catchable. “The model sometimes writes secrets into files” is catchable. “The model is sometimes less creative than we’d like” is not. Hooks add value where the failure has a shape; they add nothing where the failure is a vibe.
Context assembly is nontrivial. RAG pipelines retrieval-augmented generation, where the system fetches relevant text from a database and slips it into the prompt before calling the model, covered in 10, multi-document synthesis, memory systems (Chapter 14), long histories to trim. When the context window itself is a resource that needs managing per call, the code that manages it is orchestration whether you call it that or not.
Policy needs enforcement beyond the system prompt. Anything security-sensitive, compliance-relevant, or audit-required belongs in code, not in prose the model is asked nicely to respect. Prompt injection research (Greshake et al. 2023) has repeatedly shown that system-prompt instructions can be subverted by adversarial inputs. Hooks running server-side cannot.

Three conditions, conversely, that should make you resist building orchestration:

One-shot queries. A human asking a model one question does not need a hook system. It needs a good prompt.
Exploratory workloads. If you do not know what the correct output looks like yet, you cannot write hooks that check for it. Build the skill and hook layer after you have operational data on what goes wrong.
Creative generation. A writing assistant that occasionally produces an odd sentence does not benefit from a Stop hook that re-runs until the sentence is “good.” You will pay in cost and latency for no measurable quality gain, and you will make the system less transparent to its users.

The paper-summary task sits right on this boundary, and the boundary is where the interesting calls happen. One person pasting one paper into a chat window wants a good prompt and nothing else. A research institution running ten thousand paper summaries a day across a compliance-sensitive corpus, medical reviews, patent prior-art searches, legal briefs, needs all four investment conditions above. Same task. Different throughput. Different orchestration answer. The invest-or-not call depends on volume and risk, not on the task.

One trap to name out loud. Orchestration tooling is fun to build. Vendors ship shiny new primitives every quarter. Agent frameworks have generated more blog posts than production systems. If you are tempted to rebuild your harness because a new pattern landed, ask whether the pattern solves a problem you currently have. Most teams should spend more time measuring their failures and less time adopting new orchestration abstractions.

The test: can you write a unit test for the failure? If yes, a hook can catch it, and the investment is likely worth it. If no, you are looking at a prompting, retrieval, or model-choice problem. Skills and hooks are not a substitute for knowing what you want the model to do.

13.4 Skills versus MCP: the same substrate, different ends

When Anthropic shipped skills in October 2025, Simon Willison wrote that they might be “a bigger deal than MCP” the Model Context Protocol, an open standard for how LLM applications talk to external tools, covered in 12 (Willison 2025). The critique that sparked that claim is worth engaging directly.

His argument ran like this. An MCP server such as GitHub’s exposes dozens of tools. The schemas for those tools all get loaded into the model’s prompt, so the model knows which tools exist and what arguments they take. Each schema costs tokens. Across a full server, the total comes to tens of thousands of tokens, roughly a short book chapter’s worth of text, reloaded into the prompt on every turn whether the tools get used or not. A skill, by contrast, loads its own instructions on demand. The prompt does not pay a running tax for instructions that are not relevant right now. The token accounting favors skills in many common workflows.

Willison is right about the token accounting in that specific failure mode. MCP servers that expose dozens of tools do impose a real context-window tax, and some of the servers that shipped in 2025 imposed more tax than necessary because their authors had never audited their own schema sizes. But the “skills vs MCP” framing is mostly wrong, and reasoning about why teaches something useful.

Skills and MCP answer different questions. MCP (Anthropic 2024; Model Context Protocol Working Group 2025) answers: how does a host process expose capabilities (tools, resources, prompts) to a model in a standardized way, with authentication, discovery, and interop across vendors? Skills answer: how does the model know how to use capabilities well, in a way that the operator owns, can audit, and loads on demand?

The two complement each other. An MCP server gives you a GitHub tool. A skill tells the model when to use the GitHub tool, with what workflow, with what edge cases to handle. Remove the MCP server and the skill has nothing to call. Remove the skill and the model fumbles the tool.

Willison’s MCP critique generalizes to a real design principle. Tool surfaces carry a context-window cost proportional to their breadth, not their usage. Every tool in a connected MCP server’s schema lives in the prompt on every turn, whether any of them gets invoked or not. This is not inherent to MCP the protocol. It is an implementation choice in how the harness stages discovered tools. Harnesses can, and some already do, selectively load tool schemas per task, using skill-like relevance detection. When that becomes standard across MCP hosts, the token-accounting argument against MCP dissolves. When it does not, the argument stands.

Bring it back to the paper. A user pasting a paper and asking for bullets does not care about MCP’s schema tax. No MCP server was involved. A user who wants the summary to cite the paper’s GitHub repo, fetch the referenced preprint from arXiv, and file the finished bullets to Notion is paying the schema tax for all three servers on every turn of the summary. A well-written summary skill plus selective schema loading is the difference between a 4,000-token summary request and a 40,000-token one. Same task, same output, ten times the bill if the harness does the naive thing.

This is a case where being opinionated inline is warranted but the opinion might age fast. As of April 2026: MCP is the right integration protocol for tool exposure and discovery. Skills are the right abstraction for loading instructions on demand. Both persist. MCP is for integration (what capabilities exist, how to call them, how to authenticate). Skills are for instruction (when to use those capabilities and how). A mature system uses both. The pundit framing that makes them fight is a category error that will read as dated by the next printing. The argument is over which harness pattern handles their composition well, and that pattern will keep evolving. Do not invest heavily in today’s answer.

13.5 A vendor snapshot, and a warning

A quick survey of where three major vendors stood in April 2026, with the explicit caveat that this section has the shortest shelf life of anything in the book. Product names, file locations, and API shapes will all have drifted by the time you read this. Treat it as a reading of the moment, not a stable taxonomy.

Anthropic shipped the most composable stack: skills, hooks, slash commands, subagents, and MCP as the integration protocol. The design tilts toward auditability. Skills live as markdown files the operator owns and can inspect. Hooks run as shell commands the operator writes. The harness belongs to the operator’s program, not Anthropic’s black box.

OpenAI oriented toward consumer product surfaces. GPTs for end-user bundles in the GPT Store, the Assistants API for programmatic use, and the Agents SDK (2025) for building tool-using systems. Through most of 2025 OpenAI did not ship an MCP-compatible surface. By late 2025 and into 2026, OpenAI, Google, and Microsoft all adopted MCP in some form. The December 2025 donation of Agent Skills to the Linux Foundation, combined with OpenAI’s reported work on Skills export for GPTs (cross-reference Chapter 12), suggests convergence on skills as a cross-vendor format.

Google shipped Gems for consumer Gemini, Vertex AI Agents for programmatic use, and context caching as an architectural answer to the “how do we let the model remember a long document” problem (Google 2026). Google has moved slower on the skills and hooks axis and more aggressively on long-context strategies as an alternative to elaborate harness scaffolding. Whether the “just stuff it into context” bet pays off depends on how Gemini’s long-context behavior holds up against lost-in-the-middle effects (Liu et al. 2024), a live question Chapter 14 returns to.

The vendor divergence itself teaches something. Three different bets on where the orchestration complexity should live. Anthropic has invested in explicit operator-owned scaffolding. OpenAI in consumer-packaged bundles plus programmatic SDKs. Google in architectural primitives like context caching that reduce the need for operator scaffolding. None of these wins outright. Which one fits best depends on what you are building.

The paper-summary task picks a different winner in each vendor’s world. On Anthropic, you write a summarize-research-paper skill and a PostToolUse hook for the filing step. On OpenAI, you ship a GPT with the same workflow baked in and trust the store’s infrastructure for delivery. On Google, you lean on context caching to keep the paper resident across follow-up questions without re-uploading. Same task, three different shapes. None is wrong. Each shape is wrong in a different way once the vendor moves, which is why the mechanism-level understanding you are building here matters more than the vendor-level recipe any one of them shipped this quarter.

Also worth running the skeptic’s kit over any vendor claim in this space before you believe it. When a vendor announces that their new orchestration primitive “makes your agents smarter,” ask the Sagan-style questions out loud. Is the claim reproducible outside their demo environment? Does a simpler explanation fit, maybe that the demo prompt was tuned to the primitive? Who benefits from you believing it? (In almost all cases, the vendor shipping the primitive.) Agent frameworks have a higher ratio of blog posts to production deployments than almost anything else in software. Model the skepticism you want the reader to bring to the next vendor keynote.

Expect this snapshot to age fast. This is the section most likely to be materially wrong in eighteen months. Use it to understand the shape of the 2026 vendor landscape; do not use any specific product SKU or file path mentioned here as a long-term reference. The book’s online errata will track drift. Vendor docs will be further ahead than either.

13.6 Reliability is still the hard part

One last claim, because it ties the chapter back to the benchmark reality from Chapter 12 (Chapter 12). Dressing a stateless function in better scaffolding does not make the underlying reliability numbers go up. τ-bench (Yao et al. 2024) measured that even frontier models with mature tool-use harnesses succeed on fewer than half of real-world multi-turn retail tasks. Consistency across repeated runs (pass^k [the probability of a model solving the same task correctly k times in a row, which drops fast as k grows]) drops sharply with k. OSWorld (Xie et al. 2024) showed frontier computer-use agents at 12% where humans hit 72%. SWE-bench Verified (OpenAI 2024) exists because the first SWE-bench ran noisy enough that the headline numbers could not be trusted without a human-filtered subset.

What good orchestration changes is not the base reliability of a single model call. It changes two other things. First, the blast radius when the call goes wrong: hooks catch the bad outcome before it becomes a user-visible bug or a security incident. Second, the observability of where it went wrong: a harness that logs the prompt, the output, and the tool trace gives you data you can use to improve. The underlying per-call success rate still sits with the model.

The paper task makes the point concrete. Suppose your summary skill is well-tuned and your model gets a good summary right 95 times out of 100. Five times out of 100, the summary miscounts a figure, invents a citation, or confidently states the opposite of the paper’s actual conclusion. You cannot orchestrate that 5% down to 0%. You can, with hooks, catch the three of those five where the summary contradicts the paper’s abstract (an intrinsic hallucination a PostToolUse check can flag) or cites a page number that does not exist (same check, different rule) or names a reference the subagent pass never fetched (state check). You cannot catch the two of the five where the summary is simply subtly wrong in a way that reads fluent. The model’s ceiling stays where it was. The floor of the visible failures goes up. Good orchestration is floor-raising work, not ceiling-raising work.

This is why the orchestration layer matters. Not because it makes the model smarter (it does not), but because it makes a stochastic component safe enough to deploy, auditable enough to debug, and bounded enough to reason about. Everything about the model’s behavior stays probabilistic in ways that cannot be fully reasoned out. Everything about the harness, if you write it well, stays deterministic in ways that can.

We do not fully know how to close that probabilistic gap. The last five years of scaling have cut raw error rates on a lot of workloads. The failure distribution is lumpy in ways nobody has pinned down. A model handles nine thousand paper summaries well, then confidently miscounts figures on the nine-thousand-and-first, with no clear surface-level signal about why. Orchestration contains that surprise. It does not remove it.

“How does the system remember across sessions” is not a separate question from “how does the harness stage context,” it is the same question with a long clock. The Kalai (Kalai et al. 2025) structural framing of hallucination and the Brown (Brown et al. 2020) in-context-learning framing of model behavior both land the same diagnosis: continuity is faked by the harness, or it does not exist. Skills and hooks are how the harness fakes it for the short clock. Memory systems are how it fakes it for the long one.

Orchestration does not lift the model’s ceiling; it lowers the floor of the failures. If your product is a model plus a harness, the benchmark numbers measure the model. Your users experience the harness. Invest accordingly.

The chapter you just read is, by design, the one most likely to be out of date by the next printing. The principles hold. Skills stage instructions on demand. Hooks stage enforcement around calls. Subagents isolate context. Orchestration is code that decides what goes into the context window. The recipes do not hold. When a new abstraction lands, in 2026 or 2027 or beyond, ask which of these three questions it answers: what goes into the context, what is allowed to happen on the call, what do we do when something unexpected comes out. If it answers any of those, it belongs in this chapter’s family. If it claims to make the model “smarter” or “more agentic,” it is almost certainly rebranded context staging.

Read accordingly.

Anthropic. 2024. Introducing the Model Context Protocol. Anthropic News. https://www.anthropic.com/news/model-context-protocol.

Anthropic. 2025a. Equipping Agents for the Real World with Agent Skills. Anthropic Engineering blog. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills.

Anthropic. 2025b. Introducing Agent Skills. Anthropic News. https://www.anthropic.com/news/skills.

Anthropic. 2026a. Automate Workflows with Hooks (Claude Code). Claude Code documentation. https://code.claude.com/docs/en/hooks-guide.

Anthropic. 2026b. Slash Commands in the SDK. Claude API documentation. https://platform.claude.com/docs/en/agent-sdk/slash-commands.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.

Google. 2026. Context Caching (Gemini API). Google AI for Developers documentation. https://ai.google.dev/gemini-api/docs/caching.

Greshake, Kai, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. “Not What You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23). https://arxiv.org/abs/2302.12173.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

Model Context Protocol Working Group. 2025. Model Context Protocol Specification, Version 2025-11-25. Modelcontextprotocol.io. https://modelcontextprotocol.io/specification/2025-11-25.

OpenAI. 2024. Introducing SWE-Bench Verified. OpenAI research. https://openai.com/index/introducing-swe-bench-verified/.

Schick, Timo, Jane Dwivedi-Yu, Roberto Dessı̀, et al. 2023. “Toolformer: Language Models Can Teach Themselves to Use Tools.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2302.04761.

Willison, Simon. 2025. Claude Skills Are Awesome, Maybe a Bigger Deal Than MCP. Simon Willison’s Weblog. https://simonwillison.net/2025/Oct/16/claude-skills/.

Xie, Tianbao, Danyang Zhang, Jixuan Chen, et al. 2024. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.” Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track. https://arxiv.org/abs/2404.07972.

Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. \(\tau\)-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. https://arxiv.org/abs/2406.12045.

Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.03629.