Quality Decay in LLM-Assisted Code | Working with Large Language Models | Synoros

Let a model write code for you without reading what it writes, and after a few rounds of small edits you will have a pile of code that nobody understands anymore.

That is the plain-language seed of the chapter. There is a particular kind of code an LLM produces when you stop reading what it writes. It runs. Its tests pass. At a glance, it looks like the code a competent engineer would have shipped. After three or four rounds of “fix this small thing,” it has quietly turned into something else: a layered, over-abstracted, inconsistently-typed pile that nobody, human or model, has a clear picture of anymore.

The argument here is not that LLM-assisted coding is bad. I use it every day, and so do most of the operators this book is for. The argument is narrower: iterative LLM code generation has a measurable quality-decay mechanism, and recognizing it is the difference between a useful tool and a polished mess you have to rewrite.

The operator frame matters. Modern coding assistants are fluent enough that the cost of skipping review has never been lower per-minute, and the cost of the accumulated skipping has never been higher per-project. That asymmetry is the whole chapter.

Our running example still applies, and this chapter pulls it into a form we have not used before. Asking an LLM to summarize a forty-page research paper is a single-shot task: one prompt in, one block of text out, and either it reads well or it does not. Asking an LLM to write the tool that summarizes the paper is different. You are now building a small script that splits the PDF into sections, calls a model on each, and stitches the answers together. Every turn after the first (“fix the heading parser,” “now add retries,” “now skip figure captions”) is a step inside the decay loop this chapter is about. The script runs. It looks reasonable. A week later it is quietly doing something you would not have written by hand, and you cannot point to the turn where that happened. Keep that script in mind while we walk through the mechanism. Every example below happens inside it.

20.1 The vibecoding phenomenon

The name is Karpathy’s. On February 2, 2025, he posted on X describing a workflow he called “vibe coding”: talk to the model, accept its diffs [the model’s proposed changes to the code, shown as removed and added lines] without reading them, copy-paste the errors back when something breaks, and let the project accrete while you focus on the feel of the thing rather than the code underneath (Karpathy 2025b). The post was celebratory. It also described, with unusual honesty, the failure mode that comes with the workflow. The code exists in a state only the model and the runtime understand. You have outsourced the mental model of your own system.

The term caught. Merriam-Webster and Collins both picked “vibe coding” as 2025 words of the year. In the meantime, the adjacent community word slop, originally used for low-quality AI-generated text, spread to code: a “sloppy” codebase, “slopcode,” “vibecoded slop.” For our purposes, both words name the same thing. Vibecoding is the workflow (talk to the model, don’t read diffs, iterate by vibes). Slop is the artifact the workflow produces (code that runs but has no coherent author, human or otherwise).

Apply this to the paper-summary tool. You start with a thirty-line script. You ask the model to add retries, accept the diff, ship. You ask it to also skip figure captions, accept the diff, ship. You ask it to handle a new journal’s two-column layout, accept the diff, ship. Three turns in, the script is 180 lines, uses a queue you never asked for, imports a PDF library you have never heard of, and the retry logic has quietly moved from “wrap the API call” to “wrap the whole pipeline.” Nothing is broken. Nothing is particularly right either. That is vibecoding, and the 180-line script is slop.

Karpathy himself walked the framing back during 2025. His 2025 LLM Year in Review shifted the preferred term to “agentic engineering,” meaning the same workflow with the operator in the loop as oversight rather than passenger (Karpathy 2025a). The vocabulary shift tracks a real disciplinary shift in how serious practitioners use these tools. The original vibe-coding move stays useful for throwaway code. For anything with a future, the operator has to stay in the loop.

Vibecoding and slop are workflow-and-artifact, not model-and-bug. The model’s outputs are not the problem. A frontier model can produce excellent code. The problem is an iteration loop in which no one is grading what gets checked in. Treating this as a model-quality issue leads to the wrong fixes (switch to a bigger model, prompt harder). Treating it as a review-discipline issue leads to the right ones (read diffs, keep tests human-written, impose type discipline).

One more piece of vocabulary before the mechanism. Context contamination is what happens when the LLM’s own earlier mistakes become part of the context it looks at on the next turn, and the model begins to pattern-match against them as if they were correct. Type safety is the property of a programming language that a compiler [the program that translates source code to something the machine can run] or static checker [a tool that reads the code without running it and flags likely bugs] can catch a category of mismatches (passing a string where a number is expected, calling a method that does not exist on an object) before the code runs. Both terms show up repeatedly in the rest of the chapter.

20.2 Why iterative generation degrades

The decay is not mysterious. Four mechanisms, each measured in the 2024 to 2026 literature, combine to produce it. You can hit any one of them and fix it. The trouble is that in a long coding session all four run at once, and they reinforce.

20.2.1 Context contamination

The first mechanism comes directly from the attention properties covered in Chapter 15. When a model iterates on a piece of code over many turns, each turn’s context contains previous turns’ code, previous turns’ error messages, and previous turns’ partial fixes. That context is not cleanly labeled as “these were mistakes, ignore them.” It is just text the model attended to, and the model’s next generation is shaped by the statistics of what came before.

Here is what that actually looks like in practice. Take the paper-summary tool. On turn one you told the model “write a script that reads a PDF, splits it by section, and summarizes each section with Claude.” It produced a clean forty-line version. On turn two, the script crashed on a figure caption. You pasted the error back and said “handle this.” The model added a try/except [a Python block that catches an error instead of letting it crash the program] around the call. On turn three, you noticed the try/except was swallowing the crash silently. You said “log the error instead.” The model added a logging line inside the same try/except, leaving the swallowing behavior in place. On turn four, you asked for retries. The model wrapped the whole thing in a retry loop, so now a silent swallow happens three times in a row instead of once.

Look at what happened. At no point did the model delete the thing that was wrong. Each turn, it looked at the context (which was mostly the broken code and the patches on top of the broken code) and added another patch. If the previous turn had a subtle type confusion the model is about to repeat, nothing in the architecture prevents the repeat. The broken approach is the context; the original spec is four turns back and has lost weight.

Attention dilution ((Liu et al. 2024)) makes this worse on long sessions. Your clean turn-one instruction (“read a PDF, split it by section, summarize each”) now competes for attention budget with three turns of error dumps, scaffolding, and half-finished fixes. The spec fades. The scaffolding stays. The model patches scaffolding.

This is the same mechanism behind the classic “it keeps trying the broken approach” frustration. The model is not stubborn. It is correctly inferring from the context that the broken approach is the current active approach, because that is what most of its context is about.

20.2.2 Security degradation under iteration

Bhatt et al. 2025 measured this quantitatively (Bhatt et al. 2025). They set up a controlled study of LLM-driven iterative refinement on realistic code, and tracked security-relevant properties (input validation, memory handling, access control, injection surfaces) across successive iterations. After five iterations of “improve this code,” critical vulnerabilities had risen 37.6% over the starting baseline. If the code began with ten serious holes, the fifth pass had roughly fourteen. Your paper-summary tool, on its fifth “please also handle…” turn, is on the same curve.

The increase was not random noise. The pattern was that each iteration tended to add new surface area (new arguments, new error paths, new abstractions), and each added surface brought new places where checks had to be placed, and the model tended to forget to place them. The first iteration’s code was reasonably tight. The fifth iteration’s code had visible scaffolding that nobody had audited.

Concretely, in the paper-summary tool: iteration one takes a PDF path as input. Iteration three takes a PDF path, a URL, a Config object, and a list of section names to skip. Each of those added inputs is a place where an attacker (or a bad PDF) can now reach in. The model remembers to validate the new ones only when you point them out, which you will not do if you are not reading the diff.

This is a specific finding, not a general theorem, but the direction matches everything else in this section. Iteration without review does not tighten code. It loosens it.

20.2.3 The broken telephone effect

Peng et al. 2025, at ACL 2025, framed the mechanism in purely information-theoretic terms (Peng et al. 2025). They set up a chain in which an LLM generates text, another LLM (or the same one in a fresh session) processes that text and regenerates it, and the cycle repeats. They measured how much of the original information survived each hop.

The result was monotonic distortion. Every iteration loses something. The loss is not uniform (some facts survive, some don’t), but across a range of tasks and models, the curve slopes down. By several iterations in, substantial fractions of the original signal are gone, replaced by plausible-sounding content the model generated to fill the gap.

Picture the children’s game the mechanism is named after. One child whispers “meet me at the red gate by the oak tree” to the next. Five whispers later, the last child is announcing “the red cat is by the house.” No single whisper was dishonest. Each one just reconstructed what it thought it heard, and the small errors compounded. Every iteration of your paper-summary tool is a whisper. The model is never lying about the spec; it is reconstructing what it thinks the current spec is, from a context window mostly full of previous reconstructions.

This is why refactoring with an LLM is structurally a broken-telephone setup. You tell it “do X to this code.” It does its version of X. You tell it “now do Y.” It does Y with its version of X as input, which was already a summary of your original intent, not your original intent. By turn five, the code you are iterating on is a summary of a summary of a summary of the original spec, and the spec itself is no longer in the context window with full weight.

Shumailov et al. 2024 measured the same mechanism at the training-data level, where it is called model collapse (Shumailov et al. 2024). When a model is trained on text that was itself generated by earlier models, which was generated by still-earlier models, and so on, the output distribution narrows over generations. Rare events disappear first. Tails collapse. By the end, the model can only produce a small fraction of the variance of the original training distribution. It is the same phenomenon. Recursive distillation of errors at the training scale is what broken telephone is at the session scale. The same family of loss, applied to two different substances: training corpora in one case, conversation context in the other.

20.2.5 The ceiling the benchmarks reveal

One last reality check. SWE-bench Pro, a harder successor to SWE-bench aimed at realistic long-horizon tasks across enterprise repositories, was published by Scale AI in September 2025 (Scale AI 2025). The best frontier models at the time (GPT-5 and Claude Opus 4.1) scored in the 23.1 to 23.3 percent range on solved tasks. Fewer than a quarter of realistic engineering tasks were completed end-to-end by the strongest coding models on the strongest benchmark targeted at realistic engineering. Hand the model four realistic tickets off a team’s backlog; expect to take the other three back yourself.

Run Sagan’s baloney-detection moves across that number before you let it settle. Is it reproducible? The benchmark methodology is published; Scale AI is a commercial operator, which is a reason to read the paper rather than the marketing, but the 23 percent figure has held across multiple replication attempts by third parties. Does a simpler explanation fit? The simple explanation is that realistic engineering tasks require reading a repo you did not write and making changes that do not break anything else, and models are weaker at that than at producing fresh code that looks right. Who benefits from this being true, or from it being false? The labs benefit from it being higher than it is; working engineers benefit from it being accurately reported. The figure is the labs’ own number on a benchmark they did not design. That makes it credible as an upper bound.

That is the ceiling, as of late 2025, for the autonomous version of this workflow. “LLMs can code” is true. “LLMs can complete a realistic piece of work without a careful operator” is, on the measurements we have, still not true three quarters of the time. Any workflow that pretends otherwise is running on the 23 percent and getting lucky on the rest.

Four mechanisms, one failure mode. Context contamination (the model sees its own errors and repeats them), security degradation under iteration (critical vulnerabilities rise ~38% after five refinement turns), broken-telephone distortion (monotonic information loss across iterative generation), and pattern-matching on the modal training example (fluency defaults to the most-common enterprise shape in the corpus). A single session can trigger all four. Review discipline is the only thing that catches the composite.

20.3 What vibecoded output looks like in the wild

The mechanisms above predict specific surface symptoms. If you know what to look for, you can spot a vibecoded file in about thirty seconds. Walk through them with the paper-summary tool in your head; each symptom below is something you will eventually see in that script if you never read its diffs.

Inconsistent naming across files. The model referred to a concept as UserSettings in one file, user_settings in another, user_prefs in a third, PreferencesDTO [Data Transfer Object, an object used only to pass data between layers of an application] in a fourth. Each turn of iteration it picked whatever name matched the local pattern it was looking at, and nothing reconciled them. In the paper-summary tool: the thing you call a “section” is Section in the parser, chunk in the summarizer, PaperSegment in the stitcher, and block in the retry logic. Every component works. None of them agree on what to call what they are working on. A project vibecoded for a week has three names for every important thing.

Dead-code branches left behind from earlier attempts. When the model tries one approach, fails, and tries a second approach, the first approach’s code often survives as an unused function, an unreferenced class, a set of imports pointing to nothing. The compile passes because nothing calls the dead code. In the paper-summary tool: the extract_by_font_size function the model wrote on turn two when you were trying to parse section headings by font is still in the file on turn seven, even though the whole script now uses regex. It is never called. It is also never deleted.

Types that compile but do not match usage. In a language with a type checker [a tool that verifies the kinds of values flowing through the code match what each function expects], the model will add type hints [declarations like “this variable is a string” or “this function returns a list”] that satisfy the checker statically and are wrong at runtime. An Optional[str] that is never None, a Dict[str, Any] whose keys are a small fixed set, a Union[Foo, Bar] where only Foo is ever passed. The types compile. They do not describe the program. In the paper-summary tool: summarize_section is typed to return Optional[Summary], but it has never returned None in production; callers check for None anyway, so now there is a whole code path that cannot run but still has to be maintained.

Tests that pass because they mirror implementation bugs. The model wrote the implementation, then wrote tests. When the implementation has a subtle off-by-one error or a wrong-default-value mistake, the tests encode the same mistake and confirm it. The test suite is green. The code is wrong. If the paper-summary tool drops the last section when it splits, and the test for the splitter checks that it returns “the right sections” by passing in a known paper and asserting against the sections the splitter produces, the test will be green on the day the bug is introduced and every day after. This is the single most dangerous symptom, because it survives “does the test pass” review by a human.

Import lists longer than they need to be. Imports accrete. The model added requests when it was considering an HTTP call it never made. It imported dataclasses when it was prototyping a structure that got replaced by a dict. Each unused import is tiny. Across a codebase, the imports become an archaeological record of every idea the model tried on the way to the current state. Reading the top of your paper-summary tool tells you, in order, every direction the model briefly considered going.

Over-abstracted helpers that wrap a one-liner. A function format_user_greeting(user: User) -> str that is five lines long and called in exactly one place to produce f"Hello, {user.name}". A BaseRepositoryInterface with one method and one implementation. In the paper-summary tool: a SummaryFormatter class with a format() method that is called in one place and wraps "\n".join(bullets). The abstraction had no earning purpose. It was the shape of code the model had seen.

These are not style preferences. Each is a detectable quality signal. A linter [a tool that scans source code for common mistakes and style violations] catches some. A type checker catches some. A dependency analyzer catches imports. What the tools miss, reliably, is the semantic drift: the code runs, it just does not do what the original spec asked. Your paper-summary tool produces summaries. They may be summaries of different parts of the paper than you asked for. The tests are green anyway. That residue is what careful review is for.

20.4 Counter-disciplines

The mechanisms that produce decay are well enough understood that the counter-disciplines are not guesswork. Five disciplines, in rough order of leverage, reliably keep iterative LLM code from turning into slop. None of them requires a better model. All of them require the operator to stay in the loop.

Look at them as a family. Each discipline is a place to break one of the four decay mechanisms: a type checker breaks pattern-matching on the modal example because the modal example does not type-check against your actual data shapes; grounding against the original spec breaks context contamination because the spec re-enters the context at full weight; reading diffs breaks broken-telephone because you are a fresh reader who notices what the model’s self-reconstruction smoothed over; human-written tests break the mutual-confirmation loop between implementation and test; and simplification pressure breaks the enterprise-boilerplate default by refusing to ship it. Same family of intervention, five different leak points. Applied to the paper-summary tool, the five disciplines compose into a workflow where the tool stays roughly forty lines long even after twenty turns of iteration.

20.4.1 Type-driven development

Start with the loudest structural defense. If the programming language has a strong static type system (Rust, TypeScript in strict mode, Haskell, OCaml, and to a slightly lesser degree modern Python with full mypy --strict), a large fraction of the semantic drift the model would otherwise introduce gets refused by the compiler. The wrong-optional becomes a type error. The Union with an unhandled variant becomes a type error. The helper that returns None in a path the caller treats as present becomes a type error.

In the paper-summary tool, this means typing Section as a real dataclass with title: str, text: str, page_range: tuple[int, int]. Now the model cannot quietly rename Section to chunk in one file and pass it into a function that expects the original type, because the type checker will shout. The forty-lines-vs-ninety-lines divergence we saw in the modal-example section gets caught at the boundary.

This is not a new observation about type systems. It predates LLMs by four decades. What is new is that the asymmetry between “easy to write wrong code” and “easy to write right code” has shifted. In pure human engineering, writing untyped Python is fast because it saves keystrokes; writing Rust is slow because the borrow checker [Rust’s compile-time guard that refuses to let the same piece of data be modified from two places at once] fights you. With LLM assistance, the keystroke savings collapse (the model writes them either way), and the “borrow checker fights you” cost becomes “borrow checker fights the model for you.” That flips the ergonomics. Strong types are now a net productivity win in almost every workflow where an LLM is doing the typing.

The evidence on LLM-specific type-directed coding is still emerging, so I am not going to overclaim here. The broader software-engineering literature is clear that strong static types catch a significant fraction of real bugs in real codebases. What I will commit to is this: when an operator adds a type checker to a vibecoded Python project, they find real problems. Every time. The checker is a free second reviewer.

20.4.2 Smaller prompts, frequent grounding

Context contamination (Section 20.2) is the mechanism that makes long iterative sessions worst. The direct counter is to keep sessions short and re-ground them against the original spec frequently.

Operational form: at each significant stage, start a fresh session with a clean context window containing only (a) the current code, (b) the original specification, and (c) the immediate task. Do not let a session carry forward ten turns of partial fixes. Treat context continuity as a cost, not a benefit.

In the paper-summary tool, this looks concrete. After the retry logic goes in on turn four, close the chat. Open a new one. Paste in the current script, the one-paragraph spec you started with (“script that reads a PDF, splits by section, summarizes each section, writes the summaries to stdout”), and the new request (“skip figure captions”). The model’s first response is now shaped by the spec, not by the accumulated mistakes from the previous session. The spec is back in context at full weight. The scaffolding the model was patching on top of has been thrown out along with the old conversation.

This contradicts the default UX [user experience, the feel of interacting with the product] of most coding assistants, which lean hard on long-running conversations to feel “smart.” The default is tuned for demo-friendly stickiness, not for code quality. Overriding it takes deliberate operator practice. The payoff is large.

20.4.3 Human reading of every diff

The vibecoding anti-pattern that Karpathy named is exactly “don’t read diffs.” Reading them, even quickly, catches most drift before it accumulates. This is the most-dismissed discipline of the five and the one with the highest return on time.

The operator practice is simple: run the model, receive a diff, read it. Not a deep code review, not a line-by-line audit, just a first-pass read that notices “wait, why did it add that class” or “why are there now three files for this.” In the paper-summary tool, the thirty-second read catches the moment when a config-file request turned into a ConfigLoader abstract class hierarchy, right there on the diff, before the class hierarchy gets written out and imported in three places. The drift points are almost always visible on the surface if you look. The reason vibecoded code accretes is not that the drift is subtle. It is that nobody looked.

The 30-second diff read has a specific failure mode: when the diff is big enough that nobody looks, everything after that diff is tainted context. The practical discipline is to refuse to accept diffs large enough that you skip reading. If the model produced a 600-line change and you are not going to read 600 lines, split the change.

20.4.4 Tests written by humans, before implementation

LLMs write tests that mirror implementation bugs because they are generating both from the same statistical process on the same context. The test is not an independent check. It is a second draft of the same guess.

The counter is test-driven development [the practice of writing the test before writing the code it tests], in its oldest and plainest form. A human writes the test against the spec, before the implementation exists. The test is explicit about what the output should be for a given input. The model’s job, then, is to produce an implementation that passes the independently-written test. This breaks the feedback loop that produces mutually-confirming slop, because the test and the implementation no longer share a generation process.

Applied to the paper-summary tool: before asking the model for the section splitter, you write a test that says “given this specific three-section PDF, the splitter returns exactly these three strings.” You hand the test to the model along with the spec and ask for the implementation. Now if the splitter drops the last section, the test goes red, and the model has to fix it. The bug cannot hide in a test that both the model and the model wrote.

This is unpopular because it is slow on the happy path. For short-lived throwaway code, do not bother. For anything with a future, the cost of writing the test first is less than the cost of later discovering that your test suite has been encoding the bug you were trying to catch.

20.4.5 Aggressive simplification pressure

The pattern-matching-on-modal-training-example mechanism (Section 20.2) makes LLMs reach for over-abstraction. The direct counter is to refuse it, on review.

Operational form: at every review pass, ask “is this abstraction earning its weight?” For helpers called in one place, collapse. For interfaces with one implementation, collapse. For layers of indirection whose only purpose is to look enterprise-y, collapse. The model will not feel offended. In the paper-summary tool, the SummaryFormatter class gets replaced by a single "\n".join(bullets). The ConfigLoader gets replaced by json.load(open("config.json")). The three layers of abstraction between the PDF reader and the Claude API call get inlined into four lines. The script goes from 180 lines back to 60. Nothing that mattered is lost.

This is the discipline that most benefits from operator taste, because the question of whether an abstraction earns its weight is genuinely judgment. An operator who has not yet developed that judgment can approximate it with two rules. First, Kernighan’s: “Debugging is twice as hard as writing the code in the first place, so if you write the code as cleverly as possible, you are by definition not smart enough to debug it.” Second, a harder version aimed at LLM output: anything the model added that you did not ask for should be justified or removed.

None of these disciplines requires a better model. They require the operator. Modern coding assistants are fluent enough that skipping them is nearly frictionless, which is exactly why they matter more than they used to. The discipline is cheap. The lack of it is what produces the slop.

20.5 When vibecoding is appropriate

I have argued for discipline hard enough that the chapter would be unbalanced without the other half. Vibecoding is a real tool with real uses. The operator’s job is to classify the project correctly, not to apply rigor uniformly.

The classification axis I find most useful is lifetime × consequence of failure. Carry the paper-summary tool through each box and notice which version of the tool you are actually building.

Short lifetime, low consequence. Vibecoding is fine, often the right tool. A one-off data-cleaning script, a UI prototype you will show once and throw away, an internal tool only three people will ever run, a research spike to see if an approach is even worth pursuing. If the paper-summary tool is a ten-minute experiment to see whether splitting a paper by section even improves summaries over feeding it the whole thing, vibecode it. Read enough of the diff to confirm it is not doing something obviously stupid, and ship. The cost of the slop in this class is close to zero because nobody will read the code after you are done with it.

Long lifetime, low consequence. A hobby project, a blog platform, an automation that runs for yourself. If the paper-summary tool is something you are going to run on your own inbox of PDFs every week, and the worst outcome is that one week’s summaries are bad, this is the class you are in. The consequence of failure is embarrassment or a small amount of re-work. Vibecoding is still viable, but the discipline ladder starts kicking in. Type the thing. Keep tests. Do not let a session accumulate for more than a few turns without grounding. This is the class where the AirPods helper lives, a Rust experiment where the library use is interesting and the consequences of a bug are a crashed audio daemon.

Short lifetime, high consequence. A one-time database migration, a script that emails customers, a financial reconciliation run. If the paper-summary tool is being used tonight, once, to summarize the papers your team is going to cite in a clinical-trial protocol being filed tomorrow, do not vibecode this. The code is short-lived, so the accumulation-of-slop argument is weak, but the consequence of a bug is large enough that every one of the disciplines applies on the initial write. Short-lived high-consequence code wants all five counter-disciplines compressed into the one session in which the code exists.

Long lifetime, high consequence. Production systems, customer-facing infrastructure, anything on the critical path. If the paper-summary tool becomes a feature inside a product that paying customers rely on to summarize legal filings, all five disciplines, always. This is the Synoros-platform class, the Costa-router class, the CostaBox class. The cost of a vibecoding session leaking slop into these codebases is months of later rework. The asymmetry is too large to trade for session-level speed.

The simplest honest position is that vibecoding is a speed tool for low-stakes code, and the only mistake is applying it to code that turned out to be higher-stakes than you realized. The second-simplest is that most code an operator writes turns out to be higher-stakes than they realized, so the prior should favor discipline, and the permission to vibecode should be an explicit decision for explicitly-throwaway work.

Classify the project. Then pick the workflow. Short life, low consequence: vibecode, read enough diffs to avoid the obvious, ship. Everything else: type it, test it, read every diff, ground the session against the original spec, and cut every abstraction that does not earn its weight. The model is a strong collaborator in both modes; the operator’s judgment is the part that does not generalize between them.

20.6 The operator’s reframe

The quality decay this chapter describes is not a flaw in the models. Frontier models are strong enough to produce excellent code. The decay happens in the iteration loop, and the iteration loop is ours. We choose whether to read the diff. We choose how long the session runs before grounding. We choose whether the test suite is written by a human or by the same model that wrote the implementation. We choose whether the abstraction earns its weight.

The cultural framing of “vibecoding” and “slop” in 2025 was useful because it named a workflow that had gone unnamed for two years. Named things can be discussed. Discussed things can be disciplined. The workflow is now fully visible; the discipline is what comes next.

Two practical take-aways, stated plainly. First, the mechanisms of decay (context contamination, security degradation under iteration, broken-telephone distortion, pattern-matching on the modal training example) are all real, all measured, and all compounding when they run together in a long uncritical session. Second, the counter-disciplines (strong types, smaller prompts, diff reading, human-written tests, simplification pressure) are cheap and work. The failure to apply them is the entire gap between LLM-assisted code that holds up and LLM-assisted code that has to be rewritten.

None of this is exciting. The exciting part was the model. The boring part is what an operator has to bring around it to ship code that lasts. Getting comfortable with the boring part is the difference between the workflow being a multiplier on your engineering and being a multiplier on your rework.

Bhatt, Ashutosh et al. 2025. Security Degradation in Iterative AI Code Generation: A Systematic Analysis of the Paradox. IEEE-ISTAS 2025, arXiv:2506.11022. https://arxiv.org/abs/2506.11022.

Karpathy, Andrej. 2025a. 2025 LLM Year in Review. Personal blog (karpathy.bearblog.dev). https://karpathy.bearblog.dev/year-in-review-2025/.

Karpathy, Andrej. 2025b. Vibe Coding (X Post). X (Twitter), February 2, 2025. https://x.com/karpathy/status/1886192184808149383.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

Peng, Hui et al. 2025. “LLM as a Broken Telephone: Iterative Generation Distorts Information.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://aclanthology.org/2025.acl-long.371/.

Scale AI. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Scale AI, arXiv:2509.16941. https://arxiv.org/abs/2509.16941.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631. https://www.nature.com/articles/s41586-024-07566-y.

20.1 The vibecoding phenomenon

20.2 Why iterative generation degrades

20.2.1 Context contamination

20.2.2 Security degradation under iteration

20.2.3 The broken telephone effect

20.2.4 Pattern-matching on the modal training example

20.2.5 The ceiling the benchmarks reveal

20.3 What vibecoded output looks like in the wild

20.4 Counter-disciplines

20.4.1 Type-driven development

20.4.2 Smaller prompts, frequent grounding

20.4.3 Human reading of every diff

20.4.4 Tests written by humans, before implementation

20.4.5 Aggressive simplification pressure

20.5 When vibecoding is appropriate

20.6 The operator’s reframe