Evals and Observability | Working with Large Language Models | Synoros

You cannot fix what you cannot see, and you cannot notice what you never measured.

That sentence is the whole chapter. There are two ways to watch an LLM system in production, and most shops confuse them. Picture the paper-summary pipeline we have carried through the book. The first way to check on it is to keep a folder of fifty papers on disk, each with a hand-written five-bullet summary you trust, and every Tuesday morning run the pipeline against those fifty and ask whether the bullets still match. The second way is to leave the pipeline running over the weekend, let real users paste in whatever they paste in, and on Monday look at what the pipeline actually did. Both are useful. They are not the same thing, and treating them as the same thing is the operator mistake that costs real money.

The first is evaluation. The second is observability. A shop that conflates them ships a green CI [continuous integration, the automated test runner that gates code changes before they reach production] run on Tuesday and discovers by Friday that production has been cliffing for a week on a class of paper nobody put in the folder.

Chapter 15 catalogued the shapes production LLM failures take. This chapter is about how you know which one is happening, and whether the pipeline you shipped last Tuesday has quietly regressed since. Two disciplines, and the mistake that costs real money is treating them as one.

Evals are offline, in CI, on a fixed set of inputs you chose in advance. Did the pipeline work, against the yardstick you built? Observability is online, in production, on inputs you could not have anticipated. What actually happened, and how much did it cost? Evals guard against regressions between deploys. Observability catches incidents and drift between them. A shop with evals and no observability ships a clean green CI run and then watches real users hit a class of input the team never thought to test. A shop with observability and no evals watches latency and cost dashboards stay flat while the vendor silently swaps the model underneath, and quality cliffs in ways the dashboards were never built to see.

Both, always.

16.1 Eval is not observe

The cleanest way to see the split is to write down what a question of each kind looks like.

An eval question: given the fifty papers we labeled by hand last quarter, does the pipeline still produce a five-bullet summary that a human grader agrees with at least 92% of the time? The inputs are frozen. The ground truth is frozen. The measurement is a score against a rubric we wrote. The output is a number we can compare to last week’s number.

An observability question: yesterday between 14:00 and 15:00 UTC, what fraction of paper-summary requests took longer than two seconds to produce their first token, how many were the 200-page theses that blow the context window, and how much did that hour cost us? The inputs are whatever users pasted in. The output is a measurement of the pipeline in operation. There is no ground-truth answer the model was supposed to produce. There is only what happened.

These two questions are not interchangeable. An eval cannot tell you that p99 latency [the slowest one request in a hundred, the long-tail timing that decides how slow the product feels to its unluckiest users] doubled between 14:00 and 15:00. Observability cannot tell you that the pipeline’s accuracy on the held-out fifty dropped from 93% to 87%. The two disciplines measure different coordinates of the same system. You need both sets of coordinates to locate anything.

The vendor landscape confuses this. Langfuse started as observability and added eval features. Braintrust started as eval and added observability. Both call their combined product “the AI platform” (Langfuse 2026; Braintrust 2025). If you buy one without knowing which half it came from, you inherit the other half as an afterthought, and the weak half is where your production surprises will come from.

Evals measure whether your system gets the right answer on inputs you chose. Observability measures what your system actually did on inputs you didn’t choose. A regression test suite is not a dashboard. A dashboard is not a regression test suite. If your platform vendor talks about “LLM monitoring” without distinguishing the two, they are selling you one of them and calling it both.

The forward-reference from Section 3.1 is worth cashing in here. In Chapter 3 we said that fixing the input and fixing the weights fixes the output, up to the sampler. That determinism is what makes evals possible at all. If the same prompt on the same model produced arbitrarily different outputs every time, offline eval would be measuring noise. It does not, and it is not. What does change between an eval run today and an eval run in six weeks is everything around the model: the prompt template, the retrieval index, the tool schemas, the model version the vendor silently rotated to, the harness version of the eval framework itself. Evals are how you notice when one of those moves.

Observability picks up where evals leave off. Evals run on the inputs you thought of. Observability tells you what inputs you actually received, which is reliably a broader distribution than the one you thought of. The gap between the two is where most production incidents live.

Hand the paper-summary task this pair of lenses and the point lands fast. The eval is the folder of fifty papers with five-bullet summaries a human graded last quarter, rerun every time someone touches the prompt or swaps the model. Observability is the trace of yesterday’s actual summary calls: which papers users pasted in, how long each summary took, how many tokens it burned, which ones triggered the refusal path, which ones came back with four bullets instead of five. The eval tells you whether the pipeline still works on the papers you chose. Observability tells you what papers users are actually bringing, which is the set the eval has to stay honest against.

16.2 Eval frameworks, and why there are so many

As of April 2026, five eval frameworks see serious production use. They are not interchangeable, and the fact that operators treat them as interchangeable is the source of most failed eval projects.

Briefly, in the order you are likeliest to encounter them:

Promptfoo. Declarative, YAML [YAML Ain’t Markup Language, a human-readable config file format]-first, built for CI. You write a config file listing the prompts under test, the models under test, the test cases, and the assertions each test case has to pass. Promptfoo runs the matrix and tells you which cells fail. It was built for the question does this prompt still work against this model as a GitHub Actions check, and OpenAI and Anthropic both use it for that purpose per their own public listings (Promptfoo 2026). Its opinion is that the prompt is the unit under test. For the paper-summary pipeline, this is the right fit when the question is “does my summarization prompt still produce five bullets on Sonnet 4.6 after I added one line to the system prompt?” If that matches what you are trying to measure, Promptfoo is the cheapest tool to stand up.

Inspect. Built by the UK AI Security Institute, maintained publicly on GitHub, designed around safety and capability evaluations with real tooling: sandboxes, tool-use harnesses, MCP [Model Context Protocol, the standard way tools and data sources plug into a model] integration (UK AI Security Institute 2025, 2024). Inspect’s opinion is that the capability is the unit under test: whether the model, in a realistic environment, can execute a task that has an observable success criterion. Over 2025 it became the de facto standard for government-grade safety evaluation. If your eval needs to simulate an agent using tools against a sandboxed target, Inspect is the framework built for that case. Every other option is downstream of it on rigor.

DeepEval. Pytest [Python’s standard test runner]-style, Python-native, with fifty-plus built-in metrics for common LLM tasks (Confident AI 2026). Its opinion is that the component is the unit under test: the answer your RAG pipeline produced, the JSON schema your structured-output call returned, the hallucination rate of one specific prompt chain. If your team writes Python tests for everything else and wants LLM evals to live in the same pytest directory with the same CI gate, DeepEval is the ergonomic match.

Ragas. RAG-specific. Its whole vocabulary is RAG vocabulary: faithfulness (does the answer stay inside what retrieval provided?), answer relevancy (does the answer address the question?), context precision (is the retrieved context actually relevant?), context recall (did retrieval miss anything important?) (Explodinggradients 2026). Its opinion is that the pipeline is the unit under test, and the pipeline is specifically a retrieval-then-generation pipeline. When the paper-summary pipeline grows into a full RAG system that pulls in related papers the user did not paste, Ragas measures the failure modes a general-purpose framework would miss or misclassify.

Braintrust and Langfuse. Both span eval and observability. Braintrust is commercial, closed-source, and has the tightest eval-to-observability loop in the category. Its pitch is that production traces roll straight into regression sets. Langfuse is open-source, self-hostable, ClickHouse-backed (ClickHouse [a fast open-source analytics database built to query billions of rows] acquired Langfuse in 2025), and its eval features are the younger half of the product (Braintrust 2026; Langfuse 2026). Treat both as hybrids. Either will let you do evals. Neither is the lightest-weight choice for a pure CI eval problem. They earn their place when you want the same trace to appear both in the production dashboard and in the regression set.

The frameworks disagree about what the unit of evaluation is, and that disagreement is the actual pedagogy. Promptfoo thinks in prompts. Inspect thinks in capabilities. DeepEval thinks in components. Ragas thinks in RAG pipelines. The standard operator mistake is picking a framework by popularity or by who has the prettier website, and then forcing a prompt-under-test tool to measure capabilities, or a capability-under-test tool to measure JSON validity. The fit matters.

The eval-framework choice is an “axis of test” choice, not a “which is best” choice. Ask what the unit under test is for your workload. Prompt-under-test, prefer Promptfoo. Capability-under-test with tool use, prefer Inspect. Python-native component tests, prefer DeepEval. RAG pipeline, prefer Ragas. Cross-cut of production traces and regression runs, prefer Braintrust or Langfuse. All five are active and well-maintained in April 2026; the framework wars are mostly a vendor game.

One more point that cuts across all five. Most of the expensive semantic metrics, faithfulness, answer relevancy, harm, tone, get scored by another LLM, not by a deterministic function. That practice is load-bearing across all five frameworks above, and it carries known biases. We unpack those in the next section. For now the observation is that “Promptfoo can run a Ragas metric which is evaluated by an LLM judge, graded against a rubric someone wrote at 2 AM” is a real sentence about a real pipeline, and each layer of that stack introduces a different kind of error.

16.3 What to measure: the three-layer stack

Eval frameworks give you a runner. They do not tell you what to measure. Two years of watching production LLM systems have settled a three-layer metric model, and that is the lens we use for the rest of the chapter.

Layer 1: system metrics. Latency p50, p95, p99. Tokens in, tokens out. Error rate. Cost per request. Time-to-first-token [how long before the model starts streaming its answer, the number users feel as “speed”]. These are standard operational telemetry, not specific to LLMs except that the unit prices look weird (a few dollars per million tokens, not per request) and the tail is worse than you expect. A p99 latency of five to ten times the p50 is routine once you include thinking-mode tokens, retries, and cold tool-calls. If p50 is two seconds, one request in a hundred will take closer to fifteen, which is long enough for a user to walk away from their keyboard. Put another way: on a service handling a hundred thousand paper summaries a day, a thousand users every day are staring at a loading spinner for fifteen seconds. Any generic APM [application performance monitoring, the category of tools that watch web services] product measures these, and the OpenTelemetry GenAI semantic conventions (OpenTelemetry 2026) have standardized the attribute names so you can do it portably.

Layer 2: behavioral metrics. Tool-call success rate. Schema-violation rate (how often did your structured-output call come back malformed?). Refusal rate. Retry rate. Context-window-exhaustion rate. These are LLM-specific but still deterministic to measure. A tool call either parsed and executed, or it did not. A JSON response either conformed to the schema, or it did not. A summary either came back with five bullets or it did not. Behavioral metrics catch the whole class of “the model got subtly worse at following instructions after the vendor rotated versions” failures, and they are the layer most shops underbuild. A reasonable starting rule: for every tool you expose, instrument success rate. For every structured output you depend on, instrument schema-violation rate. These are cheap to compute and expensive to lose.

Layer 3: semantic metrics. Faithfulness, groundedness, answer relevancy, helpfulness, harmfulness, tone. These require a judge, usually another LLM, because no deterministic function can tell whether a paper summary faithfully captures the paper. This is the expensive layer, in token cost and in epistemic cost, and it is where the Zheng et al. NeurIPS 2023 paper (Zheng et al. 2023) becomes required reading.

Here is the strange thing about LLM judges, before we name the biases. You are asking one autocomplete engine to grade the output of another autocomplete engine. Both engines learned from roughly the same internet. Both engines have the same blind spots. And the second one is going to write a number on the first one’s homework and hand it in as ground truth. If that arrangement does not raise a small hair on the back of your neck, reread Chapter 1. Zheng et al. measured what we should have expected: three biases, all of them systematic, all of them worth remembering by name.

Position bias. Show a judge two candidate answers in A/B form and the judge prefers whichever it read first (or, in some models, whichever it read second) at rates well above chance. Mitigation: shuffle order, run both orderings, take the agreement rate as the real signal.
Self-preference. A judge scores its own family’s outputs higher than a rival family’s outputs, at a measurable margin. A Claude judge likes Claude answers. A GPT judge likes GPT answers. Mitigation: use a judge from a different family than the model under test, or aggregate across judges.
Verbosity bias. Longer responses get higher scores for reasons unrelated to correctness. A five-bullet summary with each bullet padded to three sentences scores higher than a crisp five-bullet summary that actually caught the paper’s point. Mitigation: control for length in the rubric, or normalize scores against a length baseline.

None of these make LLM-as-judge unusable. They make it a measurement with known error bars, which is the correct framing. The cost of building and validating a good rubric, then measuring the judge’s agreement against a small set of human labels, is the part teams skip. When they skip it, their semantic scores are theater. Build a gold set of fifty to two hundred examples scored by hand. Run the judge on the same set. Measure judge-vs-human agreement (Cohen’s kappa [a statistical measure of rater agreement that corrects for chance] or simple percent agreement works for most purposes). Report the agreement number whenever you report the judge score. If the judge disagrees with humans 30% of the time, you are not measuring faithfulness. You are measuring a rough proxy for faithfulness, and your downstream decisions have to account for that.

And now the Sagan move. Before you trust a vendor dashboard telling you your faithfulness score is 87 out of 100, ask the Baloney Detection questions on the claim. Is it reproducible? Run the same judge on the same fifty examples twice and see if the scores line up. Does a simpler explanation fit? If your faithfulness score climbed from 82 to 87 the week you switched judge models, the model changed, not the pipeline. Who benefits from the number being high? The vendor whose dashboard shows it. That is the skepticism you want baked into every semantic metric you publish. Model it in your own shop and the number stops being theater.

Cost attribution is the fourth thing to instrument, and it sits across all three layers. Tag every call with feature, version, and user segment. The payoff is being able to say “feature X costs $0.04 per activation at p50, and 0.3% of activations cost more than $0.40 because the cascade fell through to Opus” rather than “our Anthropic bill is up 12% this month, we don’t know why.” Without cost attribution, every model-version rotation and every prompt change becomes a financial event you investigate two weeks late, after the bill arrives. OpenTelemetry GenAI conventions include standard attributes for gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. If you emit those correctly, cost attribution is one aggregation query away (OpenTelemetry 2026; Traceloop 2026).

16.4 Observability infrastructure

Four platforms dominate production observability in April 2026. The choice between them is governance and cost, not capability. All four will solve most operators’ problems, and the one you should pick is the one whose data-residency story, pricing, and integration with your existing stack fit your shop. In opinionated order of who should consider each:

Langfuse. Open-source, self-hostable, ClickHouse-backed. ClickHouse acquired Langfuse in 2025, which is the clearest signal that the analytics-on-traces story is the moat here (Langfuse 2026). If your compliance team needs the traces to live on your own hardware, or if you already run ClickHouse, Langfuse is the first platform to try. It is also the cheapest serious option, because the cost of running it is bounded by your own hardware.

Arize Phoenix. Open-source, OpenTelemetry-native, vendor-agnostic (Arize AI 2026). Phoenix’s opinion is that observability should be a thin layer on top of OTel [OpenTelemetry, the open-source standard for emitting traces, metrics, and logs from any service] standard conventions, not a proprietary SDK. If your shop already runs OTel-heavy and you want LLM traces to land in the same pipeline as the rest of your application telemetry, Phoenix is the cleanest fit. Self-hostable or SaaS, both work.

LangSmith. Commercial, by LangChain. Best integrated if you are already in the LangChain / LangGraph stack (LangChain 2026). The tight integration cuts both ways. It runs genuinely smoother than any of the alternatives inside LangChain, and it is disproportionately painful outside. If you are not on LangChain, LangSmith is not the first choice.

Braintrust. Commercial, closed-source. Its differentiator is the tightest eval-to-observability loop in the category, plus a custom trace-scale query engine called Brainstore (Braintrust 2026). If your discipline depends on production traces rolling into regression sets as a primary workflow, Braintrust is built for that path. The cost is lock-in. Your traces live in their system, and migrating out is nontrivial.

What all four have in common, in 2026, is that they are OpenTelemetry-compatible. That sentence sounds dry. It is the most important sentence in this section. In 2024 each vendor shipped a proprietary SDK, and exporting from one to another meant rewriting your instrumentation from scratch. That made the backend choice feel like a marriage. You picked one, you wrote code against its SDK for six months, and moving off it meant six months of undo. The GenAI semantic conventions (OpenTelemetry 2026), developed through 2024-25 and stabilized in 2026, broke the marriage. They are the shared vocabulary that every platform now speaks.

For the paper-summary pipeline, this is what that shift actually looks like in practice. You instrument your pipeline once, using OpenLLMetry (Traceloop 2026) or an equivalent OTel-conformant library. Every model call, every retrieval, every tool use emits a span with standardized attribute names: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and so on. Those spans flow through an OTel collector. From the collector you can point at Langfuse, at Phoenix, at Braintrust, or at all three at once while you decide which one you like. The paper-summary pipeline does not know or care which backend is reading its traces. Six months from now, when one vendor raises its prices or another adds the feature you need, you swap backends with a config change, not a rewrite.

The shop that picked the backend first and wrapped the vendor’s SDK around their code is running the 2024 playbook. That playbook is obsolete.

Pick the instrumentation standard first, then the backend. Instrument with OpenLLMetry or an OTel-conformant library. Emit to an OTel collector. Treat Langfuse, Phoenix, LangSmith, and Braintrust as interchangeable consumers of the same data stream. This keeps the option to migrate. Any vendor that requires you to adopt a proprietary SDK to capture traces is selling you 2024 architecture.

One governance point worth naming. All four platforms capture prompts and completions by default. Those payloads contain whatever users typed and whatever your system returned. In most production deployments that means PII [personally identifiable information, anything that can identify a specific person: names, emails, phone numbers], secrets, internal data, and occasionally things the user was not supposed to send. If users paste papers into the summary pipeline, they also occasionally paste their own half-written draft with their name and their coauthors’ email addresses in the header. Those drafts land in your observability backend exactly as pasted, unless you scrub them. Before you turn on any observability backend, decide on a scrubbing policy: what fields are masked, what patterns are redacted, how long raw traces are retained, who can query them. Langfuse, Phoenix, LangSmith, and Braintrust all support masking and retention policies, and they all make it easy to forget to turn those on.

16.5 Regression tracking and continuous eval

Once you have an eval framework and an observability backend, the next discipline is running the evals on every change. This is the part most shops skip, because it is tedious in a way that model benchmarking is not.

The change surface is larger than it looks. Each of the following is a silent migration that can tank behavior on the paper-summary pipeline overnight:

A vendor rotates a model version without telling you. Anthropic, OpenAI, and Google all do this for non-versioned aliases. Using gpt-5.4 instead of a pinned gpt-5.4-2026-04-15 means you are on a moving target. One morning the summaries have six bullets instead of five and nobody changed the prompt.
A prompt template changes. One word in the system prompt. One example added to the few-shot set.
The retrieval index rebuilds with slightly different chunking.
A tool schema gets a new optional field.
The eval framework itself updates, and a metric’s internal implementation changes under you.

Every one of these should trigger a full eval run against a stable regression set before the change hits production. This is what Promptfoo’s CI integration was built for, and why the framework is structured as YAML files committed to the repo alongside your code (Promptfoo 2026). The prompt, the model version, the test cases, and the scoring config all diff in pull requests. When someone bumps the model version, the PR shows the config change and the test results side by side. That is the loop.

The regression set itself needs discipline most teams underprovision. A useful structure has three components:

Golden set. Curated examples with correct answers, covering the main categories your system handles. Fifty to five hundred examples, maintained by hand. This is what you gate deploys on.
Adversarial set. Known-bad inputs: prompt-injection attempts, format-violating queries, edge cases the system has failed on historically. Smaller, twenty to a hundred examples. This catches the failures you already know about and want to keep from regressing.
Distributional sample. A random sample from production traffic, re-labeled periodically. This catches the failures you did not know about, the ones observability surfaced that the golden set did not anticipate.

The golden set is the one that rots. Here is the mechanism. The more aggressively your team optimizes prompts against the fifty papers in the folder, the less those fifty papers tell you about real behavior, because you have trained the prompts to pass them. Goodhart’s law in one paragraph: when a measure becomes a target, it stops being a good measure. Rotation is mandatory. A reasonable cadence is retiring 20% of the golden set every quarter and replacing it with fresh examples drawn from the distributional sample. Teams that skip rotation end up with golden sets that everything passes, and production failures the golden set does not predict, which is worse than having no golden set at all.

The Shumailov et al. model-collapse result (Shumailov et al. 2024) is the training-data analog. When training data becomes increasingly self-similar (models trained on model-generated data), the output distribution collapses in ways that look fine on the original training distribution and fail on anything real. This is the same mechanism, at a different scale. Train a model on its own outputs for long enough and the model’s world narrows to the slice of the world its earlier self liked. Tune prompts on a never-rotated golden set for long enough and your team’s world narrows to the slice of the world your golden set liked. Both machines collapse the same way. The fix is the same in both cases: keep feeding in fresh, unselected reality.

For the paper-summary pipeline, this is concrete. The golden set is fifty papers you chose on purpose, covering the shapes users actually paste: a short proof, a long review article, an empirical study with tables, a clinical trial writeup, a mathematical preprint, each with a hand-written five-bullet summary you trust. The adversarial set is the set of papers that broke the pipeline before: the 200-page thesis that blew the context window, the paper whose first page was an OCR-scrambled scan, the preprint that tried to prompt-inject its own summary into the model. The distributional sample is fifty papers real users pasted in last month, re-labeled by the team on the first Monday of each quarter.

Now picture the common Tuesday-morning moment. Somebody on your team opens a PR that swaps the summarization model from Sonnet to Haiku to save cost. In CI, all three sets run. The PR comment reports whether the cheaper model still hits the five-bullet shape on the golden set, still survives the OCR-scrambled case on the adversarial set, and still covers the distribution of actual papers users hand it. If any of those three cliffs, the PR does not merge. If all three hold, the cost savings are real. The mechanism is not complicated. It is just tedious, which is why most shops skip it.

Every model version rotation and every prompt change is a silent migration. Gate deploys on a rotating regression set. Pin model versions. Put prompts in source control. Run the eval in CI. Rotate 20% of the golden set per quarter from production samples. This is the discipline. Skipping any step is how shops end up running a quarter-stale eval suite that passes 100% while production quality cliffs.

16.6 Tying the two halves together

The eval-observability loop is not two independent disciplines taped together. Done well, they feed each other. The mechanism is worth naming, because most teams never close the loop and therefore get half the value of either half.

Production traces surface failure modes. Here is what that looks like on the paper-summary pipeline. On Tuesday at 14:17 UTC, observability shows that eight paper-summary requests in a row returned four bullets instead of five. You pull the traces. All eight papers start with a long table of contents before the abstract. The model is treating the TOC as the body and summarizing that. You now have a new test case: “paper whose first 2,000 tokens are a table of contents.” You add it to the adversarial set with a hand-written correct five-bullet summary. On Wednesday morning, when someone opens a PR to tweak the chunking strategy, CI runs the new test case and the PR is gated on whether the fix holds.

That is the loop closing in one direction. Production finds a surprise. The surprise becomes a test. The next deploy cannot regress on it.

The reverse direction matters equally. Eval failures, particularly failures on the distributional sample or the adversarial set, predict production risk. Say your golden set flags that the model scores poorly on papers with embedded code blocks. That finding should propagate to your dashboard. A panel filters production traffic to requests containing embedded code blocks, and the alert fires if the failure rate climbs above the eval-measured baseline. The eval told you where to look. Observability tells you when to act.

The mechanism for both directions is the trace as a shared object. Braintrust built its product around this observation: a trace captured in production can be replayed as an eval input, and an eval result can be attached as metadata to the production trace (Braintrust 2026). Langfuse and Phoenix both support the same workflow with different UX. The key is that the trace format stays rich enough to capture not just the final input and output but the full sequence of model calls, tool calls, and intermediate results, and that eval scores can be attached to any point in that sequence.

Operators who run eval and observability as separate disciplines with separate dashboards and separate data stores end up doing the same work twice. The shop with the shared-trace loop closed finds the TOC surprise at 14:17 and has a regression test gating the next deploy by 17:00. The shop without the loop finds the same surprise, opens a bug, loses it in the backlog for six weeks, and ships the same regression again.

16.7 Practical defaults for a new project

One opinionated starting recipe, for a team standing up eval and observability for a new LLM feature with no existing infrastructure. The feature could be the paper-summary pipeline. The recipe does not care. Swap any component for the equivalent in another stack; the shape matters, not the brand names.

Instrument with OpenLLMetry (Traceloop 2026) or an equivalent OTel-conformant library from day one, before the first user sees the feature. Emit spans for every model call, tool call, and retrieval. Include prompt, completion, token counts, model version, feature tag, and user segment as span attributes. This is the single highest-leverage piece of work in this chapter and the one most commonly skipped.
Stand up Langfuse self-hosted (Langfuse 2026) as the observability backend, or Phoenix (Arize AI 2026) if your shop already runs OTel-heavy. Configure retention to thirty days for raw traces and indefinite for aggregated metrics. Turn on PII masking before the first production request.
Write a Promptfoo config (Promptfoo 2026) committed to the repo with twenty test cases covering the main categories the feature handles. For the paper-summary pipeline that means a short proof, a long review, an empirical study with tables, a clinical trial, a math preprint, and fifteen more. Pin model versions explicitly: claude-sonnet-4-6-2026-02-15, not claude-sonnet-4-6. Run it in CI on every PR. Gate merges on green.
Add an LLM-as-judge rubric for the two or three semantic properties that matter (faithfulness, tone, harm). Validate the judge against fifty human-labeled examples and report the agreement number in the eval output (Zheng et al. 2023). If the agreement number is below 0.6 Cohen’s kappa, your judge is not a measurement, it is vibes; fix the rubric before shipping.
Instrument behavioral metrics as first-class dashboards. Schema-violation rate, tool-call success rate, refusal rate. Alert on drift above baseline, not on absolute thresholds. A 2% schema-violation rate might be fine on Tuesday; a climb from 2% to 6% on Friday is the signal, regardless of the absolute number.
Tag every call with feature and version. Build a cost-per-feature dashboard. Review it weekly.
After a month in production, sample fifty traces from the distributional sample, label them by hand, and add them to the regression set. Retire 20% of the original golden set. Repeat quarterly.

That is the starting kit. Everything after it (custom judge models, retrieval-specific Ragas metrics, Inspect-based capability evals for agent paths, cost attribution down to the user segment) extends this frame for a specific workload. The frame itself, instrument first, self-host if you can, gate CI on pinned evals, close the loop from production to regression set, is the part to get right before any of the extensions pay off.

Evals and observability are the disciplines that let you tell whether the thing you built is the thing you built. Nothing else in the book matters if you cannot answer that.

Arize AI. 2026. Arize Phoenix: AI Observability and Evaluation. Official documentation. https://phoenix.arize.com/.

Braintrust. 2025. Best AI Evals Tools for CI/CD in 2025. Braintrust blog. https://www.braintrust.dev/articles/best-ai-evals-tools-cicd-2025.

Braintrust. 2026. Braintrust: The AI Observability Platform. Official documentation. https://www.braintrust.dev.

Confident AI. 2026. DeepEval: The LLM Evaluation Framework. Official documentation and GitHub (confident-ai/deepeval). https://deepeval.com/docs/getting-started.

Explodinggradients. 2026. Ragas: Supercharge Your LLM Application Evaluations. Official documentation. https://docs.ragas.io/en/stable/.

LangChain. 2026. LangSmith: AI Agent and LLM Observability Platform. Official documentation. https://docs.langchain.com/langsmith/observability.

Langfuse. 2026. Langfuse: Open Source LLM Engineering Platform. Official documentation. https://langfuse.com/docs/observability/overview.

OpenTelemetry. 2026. GenAI Semantic Conventions. OpenTelemetry project specification. https://opentelemetry.io/docs/specs/semconv/gen-ai/.

Promptfoo. 2026. Promptfoo Documentation. Official product documentation. https://www.promptfoo.dev/docs/intro/.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631. https://www.nature.com/articles/s41586-024-07566-y.

Traceloop. 2026. OpenLLMetry: Open-Source Observability for GenAI/LLM, Based on OpenTelemetry. Official documentation and GitHub (traceloop/openllmetry). https://www.traceloop.com/docs/openllmetry.

UK AI Security Institute. 2024. Announcing Inspect Evals. AISI Work blog. https://www.aisi.gov.uk/blog/inspect-evals.

UK AI Security Institute. 2025. Inspect AI: A Framework for Large Language Model Evaluations. Official documentation. https://inspect.aisi.org.uk/.

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2306.05685.