Part I — The Reality

How We Got Here

From perceptrons to ChatGPT, a short history of the machines that guess the next word.

LLMs did not appear in 2022. They are the latest link in a chain that began in 1958. Each link in the chain was built to fix something that was broken about the link before it, and each new fix eventually revealed a new problem that the next link had to solve. Knowing the chain is the fastest way to understand why today’s models are shaped the way they are.

2.1 The trainable dot: perceptrons (1958)

In the 1950s, a researcher named Frank Rosenblatt at Cornell wanted a machine that could learn from examples instead of one you had to hand-program for each new task. He built the perceptron [a small circuit that takes several inputs, weights each one by a number, adds them up, and fires a yes-or-no based on whether the total crosses a threshold]. When you showed the perceptron an input and it guessed wrong, the rule for updating its weights was simple: nudge each weight a little bit in the direction that would have made the answer correct. Do that thousands of times, and the perceptron learns to recognize patterns.

Rosenblatt’s machine, a physical device built at Cornell Aeronautical Lab with rooms of wiring and lights, could learn to tell the letter A from the letter B. It was genuinely startling for the time. Newspapers declared that machines would soon be doing human work (Rosenblatt 1958).

Then, in 1969, Marvin Minsky and Seymour Papert published a book (Perceptrons) that proved something painful: a single perceptron can only learn patterns that are separable by a straight line. It fails on anything more complex, including the simple “exclusive or” operation that every child can do (Minsky and Papert 1969). The book was influential enough that it killed neural-network research for a decade. This is sometimes called the first “AI winter.”

The one-layer limit was the lesson. A single perceptron cannot combine features; it can only weigh them. Real problems require combinations. The fix would take another twenty years to arrive, and it required both more compute and a specific mathematical insight that was not obvious in 1969.

2.2 Stacking and the training trick: backprop (1986)

The way around the single-layer limit was to stack perceptrons into layers. The idea had been around since the 1960s. The problem was that nobody knew how to train a stack. If the network’s final answer was wrong, which layer was at fault, and by how much should each weight move?

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published the backpropagation [the math trick that figures out how much each layer’s weights should move to reduce the error, by tracing the error backwards through the layers] algorithm (Rumelhart et al. 1986). Suddenly you could train a multi-layer network.

This unlocked what people later called “deep learning” (roughly 2006 onward, once compute caught up to the theory). Networks with several layers started beating hand-engineered systems at image recognition, speech recognition, and machine translation. But one thing was still awkward: these networks processed their input in one shot, start to finish, with no memory of what they had seen before. That was fine for deciding what was in a single photograph. It was not fine for anything involving a sequence of words.

2.3 Short memory: recurrent networks (1990s)

Language is a sequence. What the word “deposit” means right now depends on whether the word “bank” appeared five words ago. A standard neural network, shown a word, has no memory of what came before.

Recurrent neural networks (RNNs) fixed this by feeding the network’s own previous output back in as part of its next input. Step by step, the network built up a hidden summary of what it had seen (Elman 1990).

RNNs could now handle sequences. But they had a new problem: they forgot too fast. During training, the mathematical signal that told each time-step how to adjust its weights got weaker and weaker the further back you went. In practice, RNNs could remember the last handful of words well, but anything from several sentences ago washed out. This is called the vanishing gradient [the math signal that tells each layer how to update gets smaller and smaller as it travels back through the network, until effectively nothing reaches the earliest layers] problem.

2.4 Actually long memory: LSTMs (1997)

In 1997, Sepp Hochreiter and Jürgen Schmidhuber published Long Short-Term Memory (LSTM), an RNN variant with an explicit internal design for deciding what to remember, what to forget, and what to write out at each step. They gave the network three “gates” that controlled the flow of information through the network’s memory over time (Hochreiter and Schmidhuber 1997).

LSTMs worked. For the next fifteen years, they were the default choice for tasks involving language, speech, and time series. By 2014 they were doing machine translation, speech recognition, and handwriting generation well enough to ship in products. But they were still slow. Because each step depended on the previous one, training could not be parallelized the way image-network training could. A long sentence was a long chain of sequential steps.

2.5 Words as directions in space: word2vec (2013)

Meanwhile, there was a separate problem: how should a computer represent a word in the first place? The traditional approach was to assign every word an arbitrary number. “Cat” might be 1341, “dog” might be 7822. The network had no way to know that cat and dog were related.

In 2013, Tomas Mikolov and colleagues at Google published word2vec, a way to turn words into embeddings [lists of numbers, typically a few hundred, that together represent the “meaning” of a word as a direction in space] (Mikolov et al. 2013). The trick: train a small network to predict either the surrounding words given one center word, or the center word given its neighbors. The numbers it learned along the way turned out to carry meaning. Words with similar meanings ended up close together in the resulting space.

Famously, the arithmetic worked. king - man + woman ≈ queen. Take the point in space that represented “king,” subtract the direction of “maleness,” add the direction of “femaleness,” and you landed very close to the point for “queen.” Meaning, it turned out, had shape. A machine that had only ever seen text, with no access to pictures or people or kingdoms, had reconstructed a piece of how meanings relate to each other out of nothing but the company words keep.

Every modern LLM still starts by turning the input into embeddings of this general kind. The details are more sophisticated now, but the core idea, that the meaning of a word can be captured by a direction in a high-dimensional space, has not changed.

2.6 Sentences as meaning transfer: seq2seq (2014)

Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google published Sequence to Sequence Learning with Neural Networks in 2014 (Sutskever et al. 2014). The target task was translation. The architecture had two halves: an encoder that read the source sentence one word at a time and squished everything it had seen into a single fixed-size vector, and a decoder that took that vector and generated the translated sentence one word at a time.

The idea of compressing a whole sentence into a single vector of fixed size was mathematically clean. In practice, it was a bottleneck. Long inputs got partially lost at the squish. Translations of short sentences were great. Translations of paragraph-long sentences degraded noticeably.

2.7 Looking back anywhere: attention (2015)

Picture writing a summary of a long research paper. You do not have the paper memorized, but it is sitting open in front of you. As you write each bullet of your summary, your eyes flick back to specific pages, pull the relevant sentences into focus, and use them to shape the next line. The bullet about methodology sends you to the methods section. The bullet about results sends you to the tables. Different bullets send you to different pages. You are looking things up as you write, not carrying the whole paper in your head.

That move was exactly what seq2seq was missing. The encoder in seq2seq squished the whole input into a single summary note. The decoder then had to write the entire output from that one note, unable to look back.

In 2015, Dzmitry Bahdanau and colleagues published a fix (Bahdanau et al. 2015). They let the decoder, at each step of generating the output, look back at every word in the input and give each one a score for how relevant it was to what the decoder was about to produce. High score: pay a lot of attention to that word. Low score: mostly ignore it. The decoder then mixed the input words in proportion to their scores and used the mixture to help decide the next output word. At the next step it scored the input words fresh, because what mattered for word 2 was not what mattered for word 3.

They called this attention [at each step of generating, the model looks back at every input word, scores how relevant each one is right now, and mixes them in by those scores to inform the next output].

To see why this matters, take the sentence “The dog I saw yesterday at the park barked loudly.” When the model is about to produce a French verb for “barked,” it has to know that “barked” refers to “the dog,” not to “the park” or “I” or “yesterday.” Attention lets the model score “the dog” much higher than the other words at exactly that moment, so the verb gets conjugated and placed correctly in the French sentence. A few words later, when the model is placing “park” in French, the attention scores will look completely different. Every output step runs its own little search over the input.

The immediate payoff in 2015 was a jump in translation quality, especially on long sentences. The deeper payoff took a couple of years to land. Attention turned out to be a general-purpose tool for any task that needed to mix information from many different places in an input: summarization, question answering, code completion, image captioning, music generation. All of them are “look at the relevant parts, ignore the rest” problems. Every modern LLM does attention many times per layer, across many layers, for every single token it produces. The transformer architecture is built out of stacked attention and almost nothing else.

Attention is where the rest of the book comes from. Before attention, a network had to compress everything it had seen into one fixed-size summary and work from that. Attention let the network keep all the input around and pick, fresh at every step, which parts mattered most for the next decision. Every ability today’s LLMs have that involves “finding something in a long context” traces back to this idea. When you paste a forty-page paper in and the model answers a question about page thirty-two, attention is what makes page thirty-two legible to the model at that moment.

2.8 Throw out recurrence entirely: the Transformer (2017)

The 2015 attention paper had bolted attention on top of recurrence. Attention did most of the real work, but the recurrent backbone was still forcing everything to process one step at a time. Training was still slow.

In 2017, Ashish Vaswani and seven colleagues at Google published Attention Is All You Need (Vaswani et al. 2017). The proposal was audacious: get rid of recurrence entirely. Use attention for everything. All positions in the input could be processed at the same time, in parallel, instead of one by one.

This is the transformer architecture. Every LLM in use today is some variant of the transformer. GPT-3, GPT-4, Claude, Gemini, Llama, Qwen, DeepSeek: all transformers. The 2017 paper is the single most important milestone in the chain. Once you could train on far more data, far faster, the scale race began.

2.9 Pretrain once, fine-tune later: GPT-1 and BERT (2018)

The obvious next move was: can you train a giant transformer once on raw, unlabeled text, and then fine-tune it for many specific tasks, rather than training a fresh model from scratch every time? 2018 was the year this clicked. Google’s BERT learned by reading sentences with some words masked out and filling them in (Devlin et al. 2019). OpenAI’s GPT-1 learned by reading one word at a time and predicting the next (Radford et al. 2018). Both were enormously effective: a big pretrained transformer, fine-tuned with a small amount of labeled data, beat specialized systems on almost every language benchmark that existed.

The modern split between base model [the raw pretrained transformer before any task-specific training] and fine-tuned model [a base model further trained on a specific task or style] comes from this era.

2.10 Scale and the surprise: GPT-3 (2020)

GPT-2 (2019) scaled GPT-1 up to 1.5 billion parameters. GPT-3 (2020) scaled it up again, to 175 billion. That is roughly as many weights inside the model as there are stars in our galaxy. The expectation was that GPT-3 would be a better version of GPT-2 at the same tasks.

The surprise was in-context learning. GPT-3 could do tasks it had never been fine-tuned for, just by seeing a few examples in its prompt. Give it three examples of English-to-French translation and ask it to translate a fourth sentence. It would. Give it any pattern, in plain language, and it would continue the pattern. The paper (Brown et al. 2020) called this few-shot learning. It was not predicted in advance. It emerged from scale.

Nobody really knows why in-context learning works. At 1.5 billion parameters it barely appeared. At 175 billion it was robust. Six years later, interpretability research is still working out the mechanism.

GPT-3 was a research demo for about a year. Then OpenAI figured out how to make it behave.

2.11 Teach it to answer: ChatGPT (November 2022)

The problem with GPT-3 was that it was an autocomplete engine, not a question-answering engine. Ask GPT-3 “How do I make bread?” and it might continue with another question about bread, or a recipe written in the style of a paywalled food blog, or a list of every bread-related search query it had seen. Useful in certain research setups. Not useful as an assistant.

OpenAI’s fix was RLHF [reinforcement learning from human feedback, a training process where humans rate multiple possible answers and the model gets nudged toward the answers humans rated higher] (Christiano et al. 2017; Ouyang et al. 2022). Take GPT-3.5, fine-tune it on a smaller dataset of example instruction-following conversations, then use human preference ratings over paired outputs to push the model toward answers people actually wanted. Wrap the result in a chat interface.

November 30, 2022: ChatGPT launched. Five days later: one million users. Two months later: a hundred million. It was the fastest-adopted consumer product in history.

Before ChatGPT, if you had pasted a forty-page paper into GPT-3 and asked for five bullet points, you would likely have gotten something strange: a continuation of the paper written in the same style, or a list of questions about the paper, or output that looked like a worse version of the paper itself. After ChatGPT, you got something shaped like a summary, with bullet points. That is the contribution of RLHF. The model had not become smarter. It had learned which shape of output its users wanted.

2.12 The frontier race (2023 to 2025)

ChatGPT opened the gates. Over the next thirty months, the field ran faster than anyone expected.

Anthropic (founded in 2021 by former OpenAI researchers) shipped Claude in 2023. The emphasis was safety and constitutional AI [a training approach where the model is shaped to follow a written set of principles during its alignment phase, rather than being pushed only by raw human preferences] (Bai et al. 2022). Google shipped Gemini. Meta released the Llama family with open weights [the actual trained numbers of the model are published, so anyone can download and run the model themselves, instead of only being able to call it through a paid API]. Mistral, a Paris-based lab, pushed open weights further.

GPT-4 came in March 2023. Reasoning-class models arrived in 2024: OpenAI’s o1 could think for minutes before answering, producing long internal chains of scratch work before giving a final response. DeepSeek (a Chinese lab) shipped DeepSeek-R1 in early 2025, an open-weight reasoning model that matched the state of the art.

By early 2026, the field had stabilized into a handful of conventions: multi-hundred-billion-parameter frontier models from the big labs, strong open-weight models around ten to seventy billion parameters that run on consumer hardware, and a reasoning-model class that generates long scratch work before answering. Context windows grew from 2,000 tokens (GPT-3, 2020) to 200,000 (Claude, 2024) to one million (Gemini 1.5, 2024) to multi-million in research. In human terms: from a short email, to a novel, to a small bookshelf, to the full text of an undergraduate degree.

Capabilities grew faster than understanding. The community shipped working products for years without a coherent theory of why they worked. This is part of why this book takes a pluralist posture on contested claims: the literature genuinely disagrees about what these models do, and the disagreements are often the most informative thing.

2.13 Where we are (April 2026)

Today, four things are simultaneously true:

  1. Frontier capabilities are genuinely useful. Claude Opus 4.7, GPT-5, Gemini 2.5, and their peers do real work on real problems, including your forty-page paper summary.
  2. Costs are dropping fast. The price per token for a frontier-tier model in 2026 is roughly one-tenth what it was in 2023.
  3. Local models are competitive for many tasks. Qwen 3.5 9B, Llama 4 8B, and DeepSeek 3 all run on a single consumer GPU and handle summarization, simple tool use, and everyday questions well.
  4. We still do not know, mechanistically, why any of this works as well as it does. Interpretability research is in its early years. Reasoning models do something, but nobody has a clean account of what. Chain-of-thought helps, but nobody can predict exactly when it will stop helping.

When you paste a forty-page paper into Claude today and ask for five bullet points, every link in the chain is in use. The perceptron descendants in every transformer layer are firing. The backpropagation training left its weights. The word embeddings turn your paper into points in a high-dimensional space. Attention lets the model look back at any page while generating the summary. The transformer architecture runs all of that fast enough to ship. Pretraining filled the model with a working picture of what papers usually look like. RLHF taught it to produce bullet points instead of continuing the paper.

Each piece was invented to solve a specific problem from the step before. You are the current beneficiary of seventy years of accumulated fixes.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR. https://arxiv.org/abs/1409.0473.
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073.
Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.
Christiano, Paul F., Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03741.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT. https://arxiv.org/abs/1810.04805.
Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/10.1207/s15516709cog1402_1.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” ICLR Workshop. https://arxiv.org/abs/1301.3781.
Minsky, Marvin, and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press.
Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI. https://openai.com/research/language-unsupervised.
Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1409.3215.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762.