Structured Outputs | Working with Large Language Models | Synoros

The moment another program has to read the model’s answer, prose stops being the right output format.

Here is the scene. You have built a pipeline that does something useful with the paper-summary task we have been carrying since Section 1.1. A forty-page paper lands in your inbox every morning. You pass it to Claude. Claude writes you a paragraph. That paragraph lists the five headline findings in well-formed English. For months, this is fine. You read it over coffee and move on.

Then, one day, you want more. You want to file those findings into a database so you can search your reading list later. You want to route any paper whose findings include the word “trial” to a follow-up queue. You want a weekly email that compares the methodology tags of everything you read this week.

You try to write the code that does this. You open the paragraph. You look at the sentences. You realise the words are not in fields. They are not in any shape a database can hold. The paragraph reads like this: “The paper presents five findings. First, the authors show that…” and the actual findings are buried in subordinate clauses, in different orders, sometimes split across two sentences, occasionally mixed with the authors’ own editorial commentary. You will not be able to pull them out without regex [a pattern-matching rule used to pull fields out of text], and regex on LLM prose is a job nobody wants to own.

The fix is to stop asking Claude for a paragraph. Ask for a typed object [a piece of data with a declared shape, like a form with named fields] instead: a headline string, a paper ID string, a confidence number, a methodology tag from a short list, a one-line quote. Five fields, each with a name, each with a type. The database can hold that. The routing rule can read that. The weekly email can render that. The paragraph is gone, and every downstream piece of your pipeline stops fighting you.

The mechanics of asking for a typed object work differently from the mechanics of asking for prose. The failure modes are different too. And the literature splits genuinely over how much of the work should happen in the model versus in the parser on the other end.

One more thing before we go. Structured outputs are the mechanism underneath tool calls, which Part III spends a whole chapter on (Chapter 12). A tool call is a structured output with a function name attached: instead of the model writing English, it writes {"function": "search_web", "arguments": {"query": "..."}}. The reasoning that makes a five-field paper digest reliable is the same reasoning that makes an agent reliable.

8.1 Why prose output past a toy is almost always wrong

The demo version of any LLM pipeline looks like this. User asks a question. Model produces a paragraph. A human reads the paragraph. Great. The production version almost never looks like that. The paragraph feeds another stage of the system, and the other stage expects fields.

Picture the paper-summary pipeline once it grows past one reader. You paste the paper in, you get back a paragraph, you want a database row. So you write a parser. The parser reads the paragraph and tries to pull out the five findings. For the first ten papers it works. For paper eleven the model started its response with “Sure, happy to help! Here are the five findings from this paper:” and your parser choked on the preamble. For paper twelve the model wrapped the whole response in a markdown code fence, and your parser choked on the backticks. For paper thirteen the model helpfully included a sixth finding it thought was interesting, and your parser dropped it. For paper twenty the model explained its reasoning in three paragraphs before listing the findings, and your parser gave up.

Every one of these is the same kind of problem. The model produced a plausible sequence of tokens. The sequence happened to include material the parser did not expect. The parser cannot tell the difference between “the model said something new” and “the model said the same thing in slightly different words.” It just fails, at 2am, on the hundredth paper, while you are asleep.

The industry response, for years, was to make the parser smarter. Write better regexes. Handle more cases. This works for a while. It collapses around the hundredth example. Jason Liu, who wrote the Instructor library we come back to later in this chapter, puts a number on this from years of production consulting: roughly one call in twenty produces output a naive parser cannot handle (Liu 2026). One in twenty sounds survivable until you multiply. A pipeline making ten thousand calls a day is breaking five hundred times a day. That is a person’s entire job, gone to cleaning up parser failures.

OpenAI’s own pre-structured-outputs measurement puts it in another light. They asked GPT-4 to return JSON matching a specific shape, using nothing but prompt instructions to do it. They measured how often the output matched the shape. The answer was around 35% (OpenAI 2024). Out of every three calls, two produced JSON that did not match the requested shape. Call this “schema compliance” [how often the output’s shape matches the shape you asked for]. Whatever the exact number is on your workload, “prose that we try to parse” is not a reliable interface between two pieces of software.

If another program reads the output, the output should be typed. Prose is the right answer when the reader is a human. The moment the reader is a function, switch to a schema. This is not a preference. It is the difference between a pipeline that needs a daily babysitter and one that does not.

The shift changes what you can debug, too. When a parse fails on a typed output, the error message tells you which field was wrong: field 'confidence' expected float, got string. You know exactly what happened. You can feed that error back to the model (“your output had this error, try again”) and get a corrected response. More on the retry loop at the end of the chapter. When a parse fails on prose, the error tells you only that your regex did not match. The regex did not match because the model said something new. Or because the model said the same thing in slightly different words. Or because the model had a good day and expressed itself more fluently than usual. Good luck distinguishing those from a log line at 2am.

The other reason to care about structured outputs early: they are the foundation of tool calls. When a model invokes a function (a weather lookup, a database query, a code execution tool), the thing on the wire between the model and the tool runner is a JSON object with a function name and an argument object. The agent chapters in Part III (Chapter 12, Chapter 13) build on this primitive constantly. A team that cannot get the paper-digest JSON working reliably will not get agents working reliably. The structural skill transfers.

8.2 The reliability/capability trade-off

Here is where the literature splits in a way the rest of this chapter has to handle with care.

On one side of the split: if you constrain the model to produce valid JSON matching a specific schema [a declaration of which fields the output must have and what types they must hold], the output is, by construction, parseable. You can tie that schema to a Pydantic model (a Python library for declaring typed data shapes and validating that values match them), feed parse errors back as retries, and end up with a pipeline that emits typed objects reliably. The Instructor community, the OpenAI and Anthropic structured-output tools, and most production agent frameworks are built on this premise.

On the other side: constrained decoding costs the model something. When the sampler [the part of the generation loop that picks one token from the model’s probability distribution] is restricted to tokens that would keep the output a valid JSON document matching your schema, the model cannot freely reason in natural language, because natural language would break the schema. Tam and colleagues measured this directly on reasoning tasks, and found accuracy drops of 10 to 15% on math, symbolic manipulation, and complex analysis when the model was forced to emit JSON from the first token onward (Tam et al. 2024). The finding was startling enough that the paper has become the standard citation for “constrained decoding is not free.”

Both camps are right. The synthesis lives in where, inside the schema, you place the reasoning.

8.2.1 What constrained decoding does under the hood

Before the synthesis, the mechanism, because the mechanism tells you why schema ordering matters later.

Recall from Section 3.1 that the model outputs a probability distribution over its full vocabulary at every step, and the sampler picks one token from that distribution. Constrained decoding slips a filter between the distribution and the sampler. At each step, the filter consults a grammar [a rulebook for what sequences of tokens count as valid, expressed either as a finite-state machine or a context-free grammar, or a JSON schema compiled into one of those] and masks every token that would make the output unparseable. The sampler then picks from the remaining, still-valid tokens.

Think of the paper-summary task under this regime. The model wants to produce:

{"headline": "A new method for...", "paper_id": "arxiv:2024.01234", ...}

At the first token step, the filter says: “The only legal opening character for a JSON object is {. Mask everything else.” The model’s probability distribution over its full 100,000-token vocabulary gets pruned down to tokens that begin with {. The sampler picks one. At the second step, the filter says: “We are inside a JSON object; the only legal next move is a key, which starts with ".” The filter prunes the distribution again. And so on, token by token, the grammar shepherding the output into a shape that the parser on the other end will accept.

The Outlines library (Willard and Louf 2023) pioneered a clever implementation: pre-compute, for every state in the grammar, which vocabulary tokens are legal next. At generation time, checking “is this token allowed?” becomes a single-table lookup. XGrammar (Dong et al. 2024) pushed this further with three-and-a-half to ten times speedups on JSON workloads. In plain terms: a structured extraction that used to take ten seconds now finishes in one or two. XGrammar is now the default constrained-decoding backend in vLLM and several other serving engines.

What the filter does not do, and cannot do, is change the model’s underlying probability landscape. If the model “wanted” to emit some English sentence before producing the JSON (“Certainly! Here is the paper digest:”), the filter masks that sentence out. The tokens that survive the mask are the ones that would have formed the JSON, but their relative probabilities are now distorted. Park and colleagues make this precise: masking high-probability tokens that violate the grammar amplifies the relative differences among the remaining tokens, producing syntactically correct but semantically less natural outputs (Park et al. 2024). Their grammar-aligned decoding work proposes adjustments that preserve the conditional distribution better, at some runtime cost.

The operator takeaway. Constrained decoding guarantees syntactic validity, not semantic quality. The model is still “thinking” in tokens, but now some of the tokens it would have wanted to produce are unavailable. For pure structured extraction, pulling five findings out of a paper, pulling five fields out of an invoice, this is a fine trade, because the model does not need to reason to find the fields. For reasoning-heavy tasks, solve this math problem and return the answer as JSON, it matters, and you have to compensate.

8.2.2 Schema ordering: the fix that survives both camps

The synthesis between Tam’s warning and the Instructor-community default is simple. Put thinking fields in the schema before answer fields.

Here is what that means on the paper-summary task. Say you want the model to give you a difficulty rating from 1 to 5 for the paper, based on how technical the argument is. Schema A looks like this:

{"difficulty": 4}

Schema B looks like this:

{"reasoning": "The paper uses measure theory in the main proof and assumes familiarity with topology. Most readers would need background in both to follow the argument step by step.", "difficulty": 4}

The same difficulty: 4 comes out either way. What changes is everything the model wrote before the 4.

Under schema A, the model is forced to commit to a number almost immediately. The first legal token after { is the opening quote of a key. The next is the d of difficulty. Then ifficulty, ":, and the model has to pick a number. There is no room anywhere in this generation for the model to think about how hard the paper is. It has to guess from whatever its first-glance impression of the input happened to be. This is the Tam failure mode. The model writes the answer before it writes the reasoning, because the schema did not leave room for the reasoning.

Under schema B, the model sees a different grammar. The first key is reasoning, which takes a string. Strings in JSON are almost unconstrained: any legal Unicode text, up to the closing quote. The filter lets the model write whatever English sentences it wants inside the string. The model can talk to itself about the paper for two hundred words, notice the measure theory, notice the topology, work out which kind of reader would struggle, and only then does it reach the difficulty key and commit to a number. By the time it writes 4, the 4 is grounded in the reasoning that preceded it.

This is the same principle chain-of-thought prompting has been exploiting since 2022 (Wei et al. 2022), applied one layer deeper. Chain-of-thought works by giving the model room to generate intermediate tokens before committing to the final answer. Schema ordering extends the move into constrained generation: even under a strict JSON grammar, you can carve out a field where the model’s distribution is effectively unconstrained, and the model will use it to think in.

There is no separate “planning module” anywhere in this. The model is still running the same next-token autocomplete loop from Section 1.1. The only difference is that putting the reasoning field first in the schema means the model writes the reasoning first, so the reasoning is present in the context when the model later has to commit to the answer field. Chain-of-thought is autocomplete writing its own scratch paper. Schema-ordered reasoning is autocomplete writing scratch paper inside a form whose fields it will fill in afterward.

Schema ordering is the single highest-leverage move in structured-output design. Put reasoning fields before answer fields. Put context fields before decision fields. Put rationale fields before confidence scores. The model writes left to right, and anything it puts in an earlier field is available to lean on when it writes later fields. Get this wrong and you pay the 10 to 15% reasoning tax Tam measured. Get it right and the tax mostly disappears.

8.2.3 The pluralist reading

Tam’s paper gets cited sometimes as if it disproves structured outputs. It does not. What it disproves is naive structured outputs, the kind where the schema is written without thinking about generation order. The Instructor-community doctrine of “always use structured outputs” is right at the interface level (typed objects are the right contract between pieces of software) but needs to be paired with schema-ordering discipline at the prompt-design level to avoid the measured reasoning loss. Both communities have useful information. Neither is complete on its own.

The honest picture in April 2026. Constrained decoding is a serving-layer capability that removes one failure mode (invalid output) and introduces a different one (reasoning capacity compressed by token ordering). You mitigate the second failure mode by schema design. The net trade is almost always favourable for production workloads. It is sometimes unfavourable for isolated hard-reasoning tasks where the reasoning-first mitigation cannot fully recover the lost capability. Know which regime your task sits in before defaulting either way.

8.3 JSON mode, schemas, and the vendor landscape

Three sliding scales of rigor exist in the vendor landscape today, and it helps to keep them straight. Imagine you are asking Claude, or GPT, or Gemini for your paper digest. Each scale is a different kind of contract the vendor gives you about what will come back.

The loosest scale is JSON mode. The vendor guarantees that the response will be valid JSON. The string will parse. The braces will be balanced. The escape sequences will be legal. What the vendor does not guarantee is which keys will be in the object or what types their values will have. So you ask for your paper digest, and the model comes back with {"title": "...", "summary": "..."} instead of the {"headline": "...", "paper_id": "...", ...} you wanted. JSON mode is useful when you need “something parseable” and you will validate the exact shape yourself. It is not enough when the shape matters, because a valid JSON object can have any keys at all. OpenAI shipped this capability in late 2023 as a response_format: json_object parameter.

The middle scale is schema-guided prompting without constrained decoding. You write out your JSON schema, you paste it into the prompt, and the model does its best to honor the shape. Your code on the receive side validates. If the validation fails, you either retry or surface the error to the user. This is what Instructor did before vendors built in native schema support. It would generate a schema from your Pydantic model, attach the schema to the prompt, and retry on validation failure. Accuracy in this regime hinges on the model and the schema. OpenAI’s own measurement put compliance around 35% for moderately complex schemas through this path (OpenAI 2024). Two out of every three calls produced something that did not match. Good enough for many tasks. Not good enough when a pipeline depends on every call.

The strictest scale is constrained decoding against a schema. The serving engine compiles your schema into a grammar and masks logits [the raw scores the model assigns each possible next token before they are turned into probabilities] at every step so that only schema-valid tokens survive. OpenAI shipped response_format: json_schema with strict: true in August 2024 (OpenAI 2024), reporting 100% schema compliance on their own benchmarks, against the 35% prompt-only baseline. Anthropic’s tool-use API gives you the same guarantee through its function calling (Anthropic 2026). Gemini shipped equivalents through 2025 (Google 2026). On open-weight local models, you get this capability from the serving engine: vLLM with XGrammar, llama.cpp with GBNF grammars [a text format for describing allowed token sequences], Outlines with Hugging Face Transformers. The model itself does not know or care that constraints are being applied.

A brief Baloney Detection pause on that 100% number. OpenAI measured compliance on their own benchmarks, using schemas they chose, on tasks they chose. That is the “arguments from authority” case from Sagan’s kit: the claim is from a party that benefits from the claim being true. Is it reproducible? Independent evaluations, including the JSONSchemaBench work below, find that compliance is high but degrades with schema complexity in ways the marketing number obscures. Does a simpler explanation fit? Yes: “compliance is near-perfect on shallow schemas and falls off as schemas get deep” would cover the observed data better than “100% compliance.” Who benefits from the simpler headline? OpenAI does. This is not a reason to dismiss the feature. The feature is real and useful. It is a reason to read the number as an upper bound rather than a guarantee.

Several practical notes on this landscape are worth keeping in view.

Schema compliance is solved at the syntactic level. When vendors say “100% compliance,” they mean the output will match the schema’s shape. They do not mean the values will be semantically correct. The model can still hallucinate field values. It can return a plausible but wrong enum [a closed list of allowed values for a field, like "status": "open" or "status": "closed"] choice. It can fabricate a paper ID that looks like a real arXiv ID but does not correspond to any paper that has ever existed. Constrained decoding is a schema-correctness tool, not a truth tool. This is worth repeating: production incidents on structured outputs rarely come from malformed JSON anymore. They come from confidently-wrong-but-valid JSON. The paper-digest pipeline that returned a real-looking paper_id field that was entirely fabricated is a true story, and its failure cost more than any malformed-JSON error ever did, because nobody caught it for a week.

Deeply nested schemas amplify every problem. JSONSchemaBench, a 2025 benchmark of ten thousand real-world schemas across six frameworks (Geng et al. 2025) (a pile of form templates roughly the size of the paperwork a small city processes in a year), found that performance degrades with schema depth across the board. Constrained decoding slows down, because the grammar gains more states to track. Prompt-only compliance drops, because the model loses track of the structure. Error modes multiply, because now there are more places for the output to go wrong. This is a schema-design signal, not a vendor-choice signal. If your schema is five levels deep, flatten it. You almost always can. Instead of nesting methodology.study_type.sample_size inside your PaperDigest, promote each to a top-level field: methodology_study_type and methodology_sample_size. The reader of the JSON does not lose any information. The model gains a much easier target to hit.

Open-weight models need a harness. A common mistake is to evaluate Qwen or Llama against GPT or Claude on structured-output tasks and conclude that the closed model is “better at JSON.” Often what you are measuring is that OpenAI and Anthropic have a constrained-decoding serving path and your local evaluation did not. Put the open model behind vLLM with XGrammar, or llama.cpp with a GBNF grammar, and the compliance gap closes sharply. The capability lives in the serving engine, not the weights. When you run the paper-digest task on a local Qwen 3.5 9B with XGrammar in the loop, the JSON comes out as clean as Claude’s.

Flat schemas, constrained decoding, reasoning fields first. That is the 95% answer for production structured outputs, the paper-digest task included. Deeply nested schemas are a design smell; fix them at the schema layer, not with more prompt-engineering. Constrained decoding is cheap to turn on; leave it on for anything downstream of parsing. Reasoning fields are free insurance against the Tam capability loss; include them whenever the task involves thought.

8.4 The Instructor pattern and the Pydantic bridge

The library that shaped how most people write this code today is Instructor (Liu 2026), originally written by Jason Liu. OpenAI credited it as the inspiration for their structured-outputs API release (OpenAI 2024). The pattern matters more than the library. Understanding the pattern is what transfers when the library gets renamed next year.

The pattern has three moves, in order.

First: declare the output shape as a Pydantic model. Pydantic is a Python library for validating typed data. You write a class where each field has a type, and Pydantic generates a validator that raises an error if an input does not match. Instructor uses these classes as the definition of the output you want. The Pydantic class is the single source of truth: it describes the shape, the types, the allowed values, and (through field descriptions) the meaning of each field.

Second: derive the JSON schema from the Pydantic model and attach it to the LLM call. Pydantic can serialise any of its models to a JSON Schema automatically. Instructor grabs that schema, hands it to the vendor’s structured-output API (OpenAI strict mode, Anthropic tool use, Gemini structured output, or a local serving engine), and sends the request. You never write the JSON schema by hand. You write the Python class.

Third: parse the response back into the Pydantic model, and retry on validation failure. When the model’s response comes back, Instructor parses it into your Pydantic class. If validation fails (wrong type, missing required field, enum value not in the allowed list), Instructor feeds the validation error back to the model as a follow-up message: “your output had this error, try again.” The model produces a corrected output, which gets re-validated. This retry loop is the hidden hero of the pattern. It is what turns a 95%-reliable single-shot into a 99.9%-reliable three-shot.

The pseudocode reads almost exactly like the prose. Here is the paper-digest task, which has been running through the book since Section 1.1, written in Instructor:

from pydantic import BaseModel, Field
import instructor
from anthropic import Anthropic

class PaperDigest(BaseModel):
    reasoning: str = Field(
        description="Think step by step about the paper's argument. "
                    "What are the five most important things it says? "
                    "Quote exact passages where possible."
    )
    headline: str
    paper_id: str
    five_findings: list[str]
    methodology_type: str  # e.g. "RCT", "observational", "theoretical"
    would_recommend: bool

client = instructor.from_anthropic(Anthropic())

result = client.messages.create(
    model="claude-sonnet-4-6",
    response_model=PaperDigest,
    messages=[
        {"role": "user", "content": paper_text},
    ],
    max_retries=3,
)

# result is a PaperDigest, not a dict. Typed access works.
print(result.headline)
print(result.five_findings[0])
if result.would_recommend:
    add_to_reading_list(result.paper_id)

Three things about this snippet do the load-bearing work.

The reasoning field comes first in the class body, and Pydantic honors declared field order when it generates the schema. Which means the reasoning field comes first in the emitted JSON, too. Which means the model writes its reasoning before it writes the answer fields, exactly as the schema-ordering section above argued it should. The field’s description (“Think step by step about the paper’s argument…”) joins the schema, which joins the prompt the model sees. The model is literally being told to do chain-of-thought, inside the structured output, in the string field that the grammar leaves free.

The response_model parameter replaces the vendor-specific structured-output plumbing. Instructor papers over the differences between OpenAI’s json_schema mode, Anthropic’s tool use, and Gemini’s structured output, so you can write the same code against all three. If you move providers later, your call site does not change.

The max_retries parameter wires in the retry loop. On a validation failure, Instructor sends the error message back to the model, which produces a corrected output, which gets re-validated. Three tries with error-feedback recovers more bad outputs than ten tries without it, because each try is informed by what went wrong the last time.

Similar libraries exist in every ecosystem. Zod for TypeScript (with instructor-js wrapping it), Outlines and Marvin for Python alternatives, Pydantic-AI from the Pydantic team itself, serde-based crates in Rust. The library is not the point. The point is the shape: typed model → derived schema → LLM call → validated typed object, with retries on failure. Any library that implements that loop is interchangeable. Anything that does not implement it is missing the part that makes the pattern production-ready.

The pattern, not the library. Instructor codifies typed input → schema → LLM → validated typed output. That loop is what turns structured outputs from a parsing hack into a type-safe interface. Pick any library in your language that implements it. Resist the temptation to hand-roll the retry logic, because the retry logic is where the pipeline earns its keep.

One note on Instructor in particular: the library lives in Python and ships under an MIT licence. Its schema-ordering ergonomics (Pydantic field order drives JSON key order) are incidentally perfect for the reasoning-first discipline this chapter has been arguing for. That is not why the library became popular. It is one of the reasons it works well in production.

8.5 Failure modes and operator discipline

Structured outputs fail in specific, learnable ways. Knowing the failure modes in advance is the difference between a pipeline that degrades gracefully and one that breaks silently in a way nobody notices for a week. The paper-digest pipeline we have been building will meet every one of these modes eventually. Here they are, in rough order of frequency.

The model confabulates values inside a valid schema. The output parses cleanly. Every type is correct. Every enum is legal. The paper_id field reads "arxiv:2024.01234". The actual paper ID is "arxiv:2405.12345". Constrained decoding does nothing to catch this, because the output is syntactically perfect. The arxiv:2024.01234 string matches the shape of an arXiv ID. It just does not correspond to any paper that has ever existed. The model made it up. Your mitigation for this kind of failure lives downstream: ground the values in retrieval (Chapter 10), cross-check against another source of truth (look up the paper ID against the actual arXiv API), or escalate to a second LLM call with a stricter prompt. The error log will not help, because there was no error. This is the structural sibling of the hallucination failure mode from Section 1.4: the values read right, right up until you check them against reality.

The schema is too nested and the model loses the structure. A three-level-nested schema performs noticeably worse than the flat version of the same data, even with strict mode, and degrades further without it. Signs: fields at depth three come back empty, or get assigned to the wrong parent object, or get filled with plausible-but-unrelated data. Picture the paper-digest task where somebody has nested everything inside a metadata object, with authors inside metadata, and affiliations inside authors. The model fills in authors for the right paper but puts the wrong affiliations on them. Fix: flatten. Almost every deeply nested schema I have seen in production can be flattened by moving context into separate fields rather than nested objects. If you genuinely need hierarchy, emit it as separate LLM calls per level rather than one call that tries to handle the whole tree.

The schema forces the answer before the reasoning. This is the Tam failure mode. The output is correct-shaped and shallowly wrong. Fix: move any reasoning fields ahead of any answer fields in the schema. If you are using Pydantic with Instructor, this means declaring reasoning fields first in the class body, the way the PaperDigest above did.

Enum choices drift. The model picks a value that is not in your allowed enum, or picks a plausible-sounding alias of a valid value. Imagine your methodology_type enum is the closed list ["RCT", "observational", "theoretical", "meta_analysis"]. The paper you hand the model is a simulation study, which is not in your list. Without constrained decoding, the model might write "simulation" and your validator fails. With constrained decoding, the model is forced to pick from the allowed four, and it will pick whichever one is closest, probably "theoretical", which is wrong but legal. Constrained decoding solved the syntactic problem (the output is always one of the allowed values) and introduced a semantic one (the allowed values do not cover the kind of paper the input was). Fix: expand the enum to cover the natural distribution of answers, or add an other field with a free-text reason.

Retry loops hide real problems. The pattern of validation-error-back-to-the-model helps, but if you set max_retries=10 and do not log the failures, you will not notice that 30% of your calls are now taking three attempts. Instrument the retry count per call. If your p50 [the median value, the number that half your calls sit below] is zero retries and your p95 [the number that 95% of your calls sit below] is one, you are fine. If your p50 starts creeping up, either the model has changed (version drift, see Section 20.1) or your schema has a latent problem that is now surfacing on more inputs. Either way, you want to know about it before the retry loop quietly burns a quarter of your monthly API budget without anyone noticing.

Tool calls are structured outputs and fail the same way. Anthropic’s and OpenAI’s tool-use APIs are schema-constrained outputs with the name of a function attached. Every failure mode listed above applies to tool calls too, and the fix for each is the same. When a tool call returns the wrong arguments, check whether the tool schema puts reasoning before arguments. Check whether the schema is flat. Check whether the argument types are permissive enough to cover the actual distribution. See Chapter 12 for the broader agent-level treatment. The disciplines transfer verbatim from your paper-digest pipeline to your agent.

Always validate and retry with the error message fed back. The highest-leverage engineering move in a structured-output pipeline is the retry loop: parse, catch the validation error, send the error back to the model as a follow-up message, retry. Most libraries do this by default; if yours does not, add it before you add anything else. Three tries with error-feedback recovers more bad outputs than ten tries without it.

A last discipline worth naming. Structured outputs are not a substitute for evaluation. Output matching a schema tells you the structure is right. It does not tell you the answer is right. The paper-digest pipeline can return a perfectly-shaped JSON object whose five_findings are all subtly wrong, and the schema will never complain. Finding out whether the five findings are right is a different job, and it is the job of the evaluation chapter (Chapter 16). Structured outputs give you a typed stream to evaluate against. They do not evaluate for you.

Anthropic. 2026. Tool Use (Function Calling). Claude API Docs. https://docs.anthropic.com/en/docs/build-with-claude/tool-use.

Dong, Yixin, Charlie F. Ruan, Yaxing Cai, et al. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv:2411.15100. https://arxiv.org/abs/2411.15100.

Geng, Saibo, Hudson Dang, Kevin Lian, Rouvier Caio, Thomas Pellegrini, and Florian Strub. 2025. “JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models.” International Conference on Learning Representations (ICLR 2025). https://arxiv.org/abs/2501.10868.

Google. 2026. Gemini API: Structured Output. Google AI for Developers. https://ai.google.dev/gemini-api/docs/structured-output.

Liu, Jason. 2026. Instructor: Structured Outputs for LLMs. Open-source library (567-labs/instructor). https://python.useinstructor.com/.

OpenAI. 2024. Introducing Structured Outputs in the API. OpenAI Blog, Aug 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/.

Park, Kanghee, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. “Grammar-Aligned Decoding.” Advances in Neural Information Processing Systems 37 (NeurIPS 2024). https://arxiv.org/abs/2405.21047.

Tam, Zhi Rui, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, and Yun-Nung Chen. 2024. “Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models.” Proceedings of EMNLP 2024 Industry Track. https://arxiv.org/abs/2408.02442.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2201.11903.

Willard, Brandon T., and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702. https://arxiv.org/abs/2307.09702.