Appendices

Glossary

Every specialist term that gets a bracket gloss in the main chapters, collected here alphabetically. The chapter where each term first appears is noted in italics; if a definition is not enough, the chapter has the longer treatment.

A

ABI. Application binary interface, the agreed shape by which two separate programs can call into each other. Ch 12

activation values. The numbers flowing through the model as it processes a sentence. Ch 24

adversarial evals. Tests designed to find where a model breaks, not tests designed to flatter it. Ch 18

airgap. A machine with no network connection at all, physical or logical. Ch 21

anaphora. Any linguistic reference back to something previously named. Ch 11

API. Application programming interface, the agreed shape of a request that a service accepts. Ch 19

API call. How the interface talks to the actual model. Ch 1

array. Data structure holding the conversation. Ch 1

attacker-controlled webhooks. URLs that a service calls to deliver data to an external listener. Ch 19

B

BAA. Business Associate Agreement, the HIPAA-specific contract that lets a vendor legally process health data. Ch 21

base64. A scrambling scheme that turns bytes into printable letters and digits. Ch 19

being fine-tuned. Further training on a specific task. Ch 7

BF16. Low-precision number formats that use 8 or 16 bits instead of 32, trading a little accuracy for a lot of speed. Ch 9

bit quantization. A compression trick that stores each of the model’s numbers in 4 bits instead of the native 16, shrinking memory use by roughly 4x, 22  Quantization in Depth. Ch 18

bitsandbytes. Libraries for running models at 8 bits a weight on consumer hardware. Ch 24

blast radius. The set of files, services, and users a single bad call can damage. Ch 13

BM25. A keyword-matching scoring algorithm from classical information retrieval, covered in Hybrid retrieval: BM25 still matters. Ch 11

borrow checker. Rust’s compile-time guard that refuses to let the same piece of data be modified from two places at once. Ch 20

C

C4. A several-hundred-billion-token slice of crawled web text used the same way. Ch 22

calibration datasets. A small collection of example text used to guide the rounding. Ch 24

carries forward. The KV cache, explained in The inference pipeline end-to-end. Ch 22

CBRN. Chemical, biological, radiological, nuclear. Ch 19

chat template. A small formatting rule that stitches all the messages into one long stream of tokens the model can read. Ch 5

ClickHouse. A fast open-source analytics database built to query billions of rows. Ch 16

Cohen’s kappa. A statistical measure of rater agreement that corrects for chance. Ch 16

compiled artifact. A prompt that gets produced by a program rather than written by hand, the way compiled code gets produced by a compiler rather than typed as machine code. Ch 5

compiler. The program that translates source code to something the machine can run. Ch 20

constrained-decoding harness. A runtime wrapper that forces the model to pick only tokens that keep the output valid JSON, covered in 8  Structured Outputs. Ch 6

context limit. How much prior text the model can see at once. Ch 1

Context Protocol. A standard that describes the tools a model can call and the shape of their inputs and outputs, adopted across the major frontier providers through late 2025. Ch 21

context window. How much prior text the model can see at once on a single request. Ch 11

continuous batching. A serving technique that lets new requests slot into the same GPU pass as existing ones, so a hundred users can share the same forward pass. Ch 21

cosine similarity. A number from -1 to 1 measuring how close two embedding vectors point in the same direction, covered in Embeddings and cosine similarity. Ch 11

cross-encoder. A reranker that scores a query and a candidate chunk together, more accurate but slower. Ch 11

cross-entropy. How surprised the model was by the correct answer. Ch 7

cuBLAS. NVIDIA’s standard GPU math library. Ch 9

CUDA. Nvidia’s GPU-compute software stack, the industry default. Ch 21

D

dataflow accelerator. A chip designed so data flows through dedicated hardware stages instead of being shuttled back and forth to general-purpose registers. Ch 4

decode ceiling. The rate at which the model can produce each new word once it is running. Ch 21

decoder. The part of the inference loop that picks which token to emit next. Ch 12

decohere. Lose its quantum-mechanical character and become classical. Ch 26

default UX. User experience, the feel of interacting with the product. Ch 20

dense FP32. 32-bit floating-point numbers, the traditional default for scientific computing. Ch 4

dense matmul. Matrix multiplication, the single operation that dominates LLM compute. Ch 21

dense model. A model where every parameter gets used on every token, as opposed to mixture-of-experts models that only activate a fraction of their weights per token. Ch 22

diffs. The model’s proposed changes to the code, shown as removed and added lines. Ch 20

dispatcher. The harness code that reads a tool-shaped output and runs the matching function. Ch 13

distinct-n. Counts how many unique word-chunks appear in the output; higher means less repetition. Ch 9

document frequency. IDF, a weight that punishes common words and rewards rare ones. Ch 10

DPA. Data Processing Agreement, the contract that governs what a cloud vendor can and cannot do with your data. Ch 21

DPO. Direct preference optimization, a lighter-weight alternative to RLHF that adjusts the weights directly from preference pairs without a separate reward model. Ch 5

E

elementwise operations. Operations that touch each number on its own, no cross-talk between them. Ch 4

example texts. The calibration samples, used to measure the typical range and distribution of the numbers inside the model. Ch 22

F

fast biencoder. An embedding-based retriever that scores queries and documents independently. Ch 11

fine-tuning. Further training the model on your own dataset so its weights bake the behavior in. Ch 6

floating-point number. The way modern hardware stores 16-bit real numbers: one bit says whether the number is positive or negative, five bits say roughly how big it is, ten bits say its fine detail. Ch 22

FLOPs. Floating-point operations, a unit for counting how much arithmetic was done during training. Ch 7

forward pass. The single pass through the model’s layers that turns an input sequence into a probability distribution over the next token. Ch 6

FP16. 16-bit floating point, the standard full-precision format for inference. Ch 23

FrugalGPT’s cascade. A pipeline that runs a cheap model first and only falls back to an expensive one when the cheap answer looks unreliable. Ch 18

G

GB GDDR7. The latest generation of graphics memory, a step up from GDDR6 in bandwidth and power efficiency. Ch 23

GEMM. General matrix multiply, the single operation that dominates inference compute. Ch 23

generic APM. Application performance monitoring, the category of tools that watch web services. Ch 16

gone. NVLink. A direct high-bandwidth cable that lets two NVIDIA GPUs talk to each other at hundreds of GB/s, bypassing the motherboard. Ch 23

grammar. A rulebook for what sequences of tokens count as valid, expressed either as a finite-state machine or a context-free grammar, or a JSON schema compiled into one of those. Ch 8

green CI. Continuous integration, the automated test runner that gates code changes before they reach production. Ch 16

Groq’s LPU. Language Processing Unit, Groq’s purpose-built chip for transformer inference, heavy on bandwidth and scheduling determinism. Ch 4

group-query-attention. An attention variant where several query heads share a single key and value head, which cuts KV cache size without much accuracy loss. Ch 4

H

H100’s HBM3. High-bandwidth memory, the stacked-die memory datacenter GPUs use for the fastest possible throughput. Ch 23

H100s. Nvidia’s 2023 datacenter GPU, roughly $30,000 per card. Ch 21

hallucination. The model saying confidently wrong things. Ch 1

has DOM. Document object model, the tree structure a browser uses to represent a webpage. Ch 11

HBM. High Bandwidth Memory, stacked DRAM chips sitting millimeters from the compute die, connected by thousands of short wires instead of one narrow trace on a circuit board. Ch 4

hidden dimension. The number of coordinates the model uses to represent each token. Ch 24

high-QPS. Queries per second. Ch 21

homoglyph substitution. Replacing a Latin letter with a visually identical character from another script. Ch 19

hundred GFLOPS. Billion floating-point operations per second. Ch 4

hyperparameter. A tunable setting that controls how the system behaves. Ch 11

I

inner products. The basic arithmetic operation attention and matrix multiplication are both built out of. Ch 24

interpretability research. The subfield that tries to figure out what is happening inside a trained model. Ch 1

it computes. Called activations. Ch 22

itself. FP8. 8-bit floating point, half the size of FP16. Ch 23

J

JSON Schema. A plain-text way to describe what fields a JSON object should have and what types each field takes. Ch 12

K

kernel. The specific piece of GPU code that does the matrix multiply. Ch 24

known-and-patched CVE. Common vulnerabilities and exposures, the public database that tracks every disclosed software security flaw. Ch 5

KV cache. Key-value cache, a buffer that stores the attention inputs from every token the model has already seen, so the next token does not have to recompute them. Ch 4

L

latency SLA. Service-level agreement, the contractual bound on how slow a response can be before you are in breach. Ch 21

latency SLAs. P95 meaning the slowest 5% of requests must still meet the latency bound. Ch 21

latency tracker. An independent service that measures and publishes provider-by-provider latency and price numbers, refreshed weekly. Ch 21

Latent Attention. MLA, an attention variant introduced by DeepSeek that compresses the KV cache into a low-rank latent space instead of storing every key and value outright. Ch 4

latent space. A smaller set of numbers that carry the same information, reconstructed when the attention layer needs them. Ch 23

lifecycle events. Named moments in the model call where the harness pauses and gives your code a chance to run. Ch 13

linter. A tool that scans source code for common mistakes and style violations. Ch 20

LLM-as-judge. Using a language model to score another model’s outputs, covered in 16  Evals and Observability. Ch 11

load-bearing tensors. The rectangular arrays of numbers that make up each layer’s weights. Ch 22

lost-in-the-middle. The measured tendency of long-context models to neglect information buried in the middle of their context window. Ch 13

M

many-worlds interpretation. The reading of quantum mechanics under which every possible outcome of a quantum event occurs in its own branch of reality. Ch 26

marginal cost. The incremental cost of generating one more token, once the hardware is paid for. Ch 21

masks logits. The raw scores the model assigns each possible next token before they are turned into probabilities. Ch 8

Matched temperature. The knob that controls how confident the model’s next-word sampling is; higher means more varied output, lower means more deterministic. Ch 6

matmul time. The moment the model multiplies two matrices together to produce the next layer’s output. Ch 24

matrix multiplication. A huge grid of numbers multiplied against another huge grid, the main arithmetic a transformer does. Ch 9

MCP. Model Context Protocol, the standard way tools and data sources plug into a model. Ch 16

MCQA. Multiple-choice question answering, where the model picks from a fixed list of candidate answers. Ch 9

means PII. Personally identifiable information, anything that can identify a specific person: names, emails, phone numbers. Ch 16

Memory bandwidth. How fast bytes move between that memory and the compute units. Ch 21

MI300X. AMD’s 2024 datacenter counterpart to the H100. Ch 21

MLX. Apple’s array and machine-learning framework for Apple Silicon. Ch 21

model’s inference. The computation the model runs when given an input. Ch 1

MRR. Mean reciprocal rank, which rewards correct chunks for ranking higher. Ch 11

MTEB. Massive Text Embedding Benchmark, the community leaderboard for embedding evaluation. Ch 10

N

NDCG. Normalized discounted cumulative gain, a graded-relevance ranking metric. Ch 11

neural network. A network where each step’s output feeds into the next step’s input, chaining forward one token at a time. Ch 4

nothing OOMs. Runs out of memory, which crashes the process and loses your work. Ch 23

O

OCR. Optical character recognition, the process of extracting text from images of pages. Ch 11

orchestration. The code that sits around a model call and decides what goes in, what happens, and what comes out. Ch 13

orthogonal matrix. A square grid of numbers that, when you multiply by it and then by its mirror image, lands you back where you started. Ch 24

OTel. OpenTelemetry, the open-source standard for emitting traces, metrics, and logs from any service. Ch 16

outlier problem. The problem that a handful of weights carry magnitudes far outside the bulk of the distribution, making one-size-fits-all scaling fail. Ch 22

P

p99 latency. The slowest one request in a hundred, the long-tail timing that decides how slow the product feels to its unluckiest users. Ch 16

PagedAttention. A memory-management technique that lets a serving stack pack more concurrent requests onto one GPU than naive memory allocation allows. Ch 21

paper summary. Autoregressive decode, the token-by-token generation loop the model runs to produce each new word. Ch 22

parser. The software that reads the model’s reply and tries to extract structured pieces from it. Ch 6

payload. The active instructions an attacker wants the model to follow. Ch 19

PCIe. Peripheral Component Interconnect Express, the bus that connects a GPU card to the rest of the computer. Ch 4

per-channel scales. Different rounding step-sizes for different columns of the weight matrix. Ch 24

policy layer. The always-included instructions that sit above your message. Ch 7

pool. Combine many vectors into one by averaging them together. Ch 11

position embeddings. The mechanism modern transformers use to encode how far apart two tokens are, by rotating each token’s internal representation by an angle proportional to its position. Ch 15

pre-training loss. The score the model was trained to minimize while reading the internet. Ch 7

preference alignment. A final pass that nudges the model toward answers humans rate higher. Ch 7

prefill. The model reading through your whole prompt before it starts writing the reply. Ch 21

probability distribution. A ranking of how likely each possible next token is. Ch 1

Pytest. Python’s standard test runner. Ch 16

Q

Q8. 8-bit quantization, higher quality than Q4 and half the size of FP16. Ch 23

R

RAG. Retrieval-augmented generation, the pattern of looking up relevant text and slipping it into the prompt, covered in 10  RAG Without the Buzzwords. Ch 11

RAG pipelines. Retrieval-augmented generation, where the system fetches relevant text from a database and slips it into the prompt before calling the model, covered in 10  RAG Without the Buzzwords. Ch 13

rates. FP4. 4-bit floating point, the most aggressive quantization format in current production hardware. Ch 23

request. Time-to-first-token. How long before the model starts streaming its answer, the number users feel as “speed”. Ch 16

reranker. A smaller model that re-scores retrieved passages for relevance, covered in Reranking with cross-encoders. Ch 11

retrieval-augmented generation. Fetching relevant documents at request time and pasting them into the prompt so the model can read them. Ch 6

right angles. Orthogonal, unrelated. Ch 10

RLHF. Reinforcement learning from human feedback, the preference-alignment step that nudges the model toward answers humans rate higher. Ch 7

ROCm. AMD’s GPU-compute software stack, the open counterpart to Nvidia’s CUDA. Ch 21

ROT13. A simple cipher that rotates each letter thirteen places forward in the alphabet. Ch 19

router. A small program that sits in front of the model layer and decides which backend handles each incoming request. Ch 21

routing. Sending each request to the model best suited to it, instead of sending every request to the same model. Ch 18

S

Same Hopper. The architecture generation before Blackwell, introduced in 2022. Ch 23

sampler. The part of the generation loop that picks one token from the model’s probability distribution. Ch 8

second-order information. A table of second derivatives that captures how pairs of weights interact to shape the layer’s output, called the Hessian. Ch 22

self-BLEU. Measures how similar the model’s own outputs are to each other across multiple runs; lower means more variety. Ch 9

sequence grows. The model’s scratch-memory of everything that has been said in the conversation so far, stored in a specific format attention can read. Ch 24

serving endpoint. A web API that accepts the same request shape as OpenAI’s servers, so any client written to talk to OpenAI works against it unchanged. Ch 21

serving stack. The software layer that takes incoming requests, batches them, and hands them to the GPU. Ch 24

several CVE-class. A formal vulnerability disclosure with a tracking number. Ch 12

SFT. Showing the model thousands of example instructions paired with good answers, so it learns the shape of following an instruction. Ch 5

SFT stage. Supervised fine-tuning, the step where the model learns from human-written example conversations, covered in Where the weights came from. Ch 6

softmax. A mathematical step that turns a list of raw scores into a proper probability distribution that sums to 1. Ch 15

specification. A plain, widely used standard for how one program asks another program to run a function. Ch 12

SQL injection. The decades-old class of bug where attacker-controlled text gets pasted into a database query and changes what the query means. Ch 19

SRAM. Static RAM, the small, fast on-chip memory that sits right next to the compute units, orders of magnitude faster than HBM. Ch 4

stateless function. A function that keeps no memory of its past use, introduced in Not a chatbot. Ch 11

stdio. Standard input and output, the plain-text pipe two local programs use to talk. Ch 12

stress coreference. When a pronoun or phrase refers back to something named earlier. Ch 11

system prompt. The always-included instructions that sit above your message. Ch 1

T

tames sycophancy. The well-documented tendency of RLHF-trained assistants to tell users what they seem to want to hear, even when the correct answer is something different. Ch 5

TCO. Total cost of ownership. Ch 21

tensor cores. Special circuitry on newer datacenter GPUs that multiplies 4-bit numbers in bulk. Ch 24

Tensor parallelism. Splitting each layer across two cards so they do the same work in parallel on half the data. Ch 23

test-driven development. The practice of writing the test before writing the code it tests. Ch 20

TFLOPS. Tera floating-point operations per second, a measure of raw compute throughput. Ch 23

TGI. Two popular pieces of software that run models behind an API. Ch 24

toward compute-bound. Bottlenecked by the GPU’s math units rather than by the speed of its memory. Ch 23

type hints. Declarations like “this variable is a string” or “this function returns a list”. Ch 20

U

UI. User-interface, the visible part you type into. Ch 1

unified memory. Memory shared between the CPU and GPU on the same chip, which matters for LLMs because it lets the GPU work with much larger models than the 24 GB you get on a consumer card. Ch 18

unified-memory. CPU and GPU share the same RAM, so there is no copy step between them. Ch 21

user tokens. Strings that prove a caller has permission to act as a specific user. Ch 19

usually WikiText-2. A 100-million-word sample of Wikipedia, used as a shared yardstick across papers. Ch 22

V

VRAM. The memory sitting on the graphics card, shared between the weights and the running conversation. Ch 24

W

was tokenized. Chopped into chunks before the model ever saw it. Ch 7

weights. The trained numbers that make the model work. Ch 10

without regex. A pattern-matching rule used to pull fields out of text. Ch 8

wrong enum. A closed list of allowed values for a field, like "status": "open" or "status": "closed". Ch 8

Y

YAML. YAML Ain’t Markup Language, a human-readable config file format. Ch 16

your p50. The median value, the number that half your calls sit below. Ch 8

your p95. The number that 95% of your calls sit below. Ch 8