Further Reading | Working with Large Language Models | Synoros

Grouped by the chapter where the thread started. If a chapter made you want to go deeper, start here.

Each entry is a pointer, not a summary. The annotation says what the reading adds beyond what the chapter covered, so you can spend tokens on the right paper instead of the wrong one. Tier annotations (T1, T2, T3, T4) follow the same conventions as the rest of the book; see Appendix A2.

Prologue

(Karpathy 2023a) Intro to Large Language Models. If the prologue’s “scaled-up autocomplete” framing landed and you want a one-hour orientation video before going chapter-by-chapter, this is the canonical visual on-ramp.
(Hugging Face 2026) Hugging Face LLM Course. The freely-available spine course; useful if you want a hands-on companion that runs alongside this book rather than after it.

Part I: The Reality

Chapter 01: What LLMs Aren’t

(Bender and Koller 2020) Climbing towards NLU. The form-without-grounded-meaning argument that predates GPT-3 and predicts most of the failure modes the chapter catalogs. Worth reading not because it has been “vindicated” but because it shows the failure was anticipated by linguists before the scaling crowd noticed.
(Schaeffer et al. 2023) Are Emergent Abilities a Mirage? If the chapter’s anti-hype framing felt strong, this NeurIPS Outstanding Paper is the technical receipt: most “emergent” capabilities disappear under continuous metrics. Read this before you let anyone tell you a benchmark jump is a phase transition.
(Liang et al. 2025) Machine Bullshit. Pushes past “hallucination” into a four-form taxonomy (empty rhetoric, paltering, weasel words, unverified claims) and shows RLHF increases the Bullshit Index while user satisfaction goes up. Read after the chapter to understand why “the model sounds great” is itself a failure signal.
(Weng 2024) Extrinsic Hallucinations in LLMs. Lilian Weng’s literature pre-digest. If you want to traverse the hallucination corpus without reading every paper end to end, start here, then dive into (Kalai et al. 2025) and (Ravichander et al. 2025).

Chapter 02: How We Got Here

(Vaswani et al. 2017) Attention Is All You Need. The 2017 paper itself. Read it after the chapter, not before; the chapter gives you the conceptual scaffold that makes the paper readable in an hour instead of a week.
(Radford et al. 2019) Language Models are Unsupervised Multitask Learners (GPT-2). The pivot point where “scale plus next-token prediction” got taken seriously as a research program. Short, readable, and important context for why GPT-3 felt inevitable in retrospect.
(Stanford University 2025) CS336: Language Modeling from Scratch. Stanford’s 2025 refresh. If the chapter made you want to actually build the thing rather than read about it, this is the course.
(Karpathy 2023b) Let’s build GPT: from scratch, in code, spelled out. Two hundred lines of code, one video. The fastest way to feel the architecture in your hands after reading the historical narrative.

Chapter 03: The Cognitive Model

(Lambert 2026) Reinforcement Learning from Human Feedback. Nathan Lambert’s book-length treatment. The single best contemporary reference for the alignment layer the chapter sketches; go here when you want the actual loss functions and not just the conceptual story.
(Bender and Koller 2020) Climbing towards NLU. Counterweight to any chapter that uses “understanding” or “meaning” as a verb. Read it if the cognitive-model framing tempted you into thinking the model is doing something semantically richer than it is.
(Templeton et al. 2024) Scaling Monosemanticity. Anthropic’s mechanistic-interpretability work on Claude 3 Sonnet. The honest counterpoint to “facts are not keyed storage”: some features really do localize. Useful if you want to push back on the chapter’s framing in good faith.
(Karpathy 2023b) Let’s build GPT. If the cognitive model still feels abstract after the chapter, watch one engineer build it from nothing. The mystery dissolves.

Chapter 04: The Physical Model

(Williams et al. 2009) Roofline: An Insightful Visual Performance Model. The chapter uses the roofline language; this is the original 2009 CACM paper. Short, textbook-quality, and the right reference if you want the model’s history rather than its restatement.
(Stanford University 2025) CS336. The systems-side lectures cover the same ground the chapter walks but with HBM bandwidth tables and kernel-level walkthroughs. Use it when you want numbers next to the prose.
(Kwon 2025) vLLM: An Efficient Inference Engine (2025 tech report). The full-system writeup that goes beyond the PagedAttention paper. The chapter cites the SOSP paper; this report is the modern follow-up with the engineering details.
(Lienhart 2024) LLM Inference Series 5: Dissecting model performance. Pierre Lienhart’s arithmetic-intensity walkthrough at AWS. Pair with the chapter when you want to internalize the numbers rather than just the framing.

Part II: The Interface

Chapter 05: Prompting

(Wei et al. 2022) Chain-of-Thought Prompting Elicits Reasoning. The chapter mentions CoT in passing; this is the founding paper, and reading it shows how thin “elicits reasoning” actually is once you look at the tasks.
(Dziri et al. 2023) Faith and Fate. The complement to chain-of-thought boosterism: prompting cannot get the model past compositional limits. Read this before you over-invest in elaborate prompt scaffolds.

Chapter 06: System Prompts

(Ouyang et al. 2022) Training Language Models to Follow Instructions (InstructGPT). The chapter takes “the model follows instructions” as table stakes; this is the paper that made it true. Worth reading to understand exactly what the system prompt is leveraging.
(Bai et al. 2022) Constitutional AI. Anthropic’s RLAIF approach. Useful context for why Claude’s system prompts feel different from GPT’s: the alignment substrate underneath them is trained differently.

Chapter 07: Few-Shot Priming

(Brown et al. 2020) Language Models are Few-Shot Learners (GPT-3). The paper that named “in-context learning” and made it a product feature. Read it for the framing, not for state-of-the-art results.
(Schaeffer et al. 2023) Are Emergent Abilities a Mirage? Directly relevant: many “few-shot” capabilities scale smoothly and only look like phase transitions because the metric is discontinuous.

Chapter 08: Structured Outputs

The chapter’s references (XGrammar, Outlines, JSON Schema benchmarks, Instructor) are the live frontier; nothing in this book’s bibliography extends them. Stay with the chapter’s citations.

Chapter 09: Sampling and Temperature

(Holtzman et al. 2020) The Curious Case of Neural Text Degeneration. The chapter cites it; if you skipped the original, read it now. Nucleus sampling, repetition collapse, and the empirical case for why greedy decoding is not free.
(Sennrich et al. 2016) Neural Machine Translation of Rare Words with Subword Units (BPE). Sampling decisions interact with tokenization in ways the chapter sketches; this is the reference for what the tokens actually are.

Part III: Agency and Augmentation

Chapter 10: RAG

(Kalai et al. 2025) Why Language Models Hallucinate. The chapter cites it; the statistical-classification framing is the cleanest argument for why retrieval helps and where it stops helping. Re-read after the chapter, not before.
(Templeton et al. 2024) Scaling Monosemanticity. The mechanistic counterpoint to “the model has no internal facts to retrieve from”: some features genuinely do localize. Honest reading after the RAG chapter sharpens what retrieval is and is not for.
(Weng 2024) Extrinsic Hallucinations. Orientation document for the broader hallucination literature the chapter touches; useful before diving into individual papers.

Chapter 11: Chunking Strategy

(N. F. Liu et al. 2024) Lost in the Middle. The chapter cites it. If the chunking discussion left you wanting evidence that “advertised context” and “usable context” are different things, this is the empirical case. Read the curves, not just the abstract.

Chapter 12: Tool Use and MCP

(Kalai et al. 2025) Why Language Models Hallucinate. Tool use is partly an answer to the question this paper poses. After the chapter, this gives you the theoretical motivation rather than the engineering surface.
(Willison, n.d.) Simon Willison’s prompt-injection tag corpus. Tool use is the attack surface; this is the running archive of how it actually breaks in the wild. Subscribe rather than read once.

Chapter 13: Skills, Hooks, and Orchestration

(Kalai et al. 2025) Why Language Models Hallucinate. Same recommendation as the tool-use chapter; orchestration adds layers, and each layer inherits the underlying statistical-classification problem.
(Willison, n.d.) Willison’s tag corpus. Orchestration multiplies surfaces; the corpus is the field log of what gets exploited and how.

Chapter 14: Memory Systems

(Kwon et al. 2023) PagedAttention / vLLM (SOSP 2023). The chapter treats KV cache as the substrate of short-term “memory”; this is the systems paper that defined how KV cache is actually managed. Read when you want to understand the layer beneath the memory abstractions.
(N. F. Liu et al. 2024) Lost in the Middle. Long-context “memory” is not uniform across positions. Worth re-reading specifically as a memory paper, not a retrieval paper.

Part IV: The Crucible

Chapter 15: Failure Modes

(Dziri et al. 2023) Faith and Fate. The compositionality wall the chapter points at, with the actual evidence. Required reading if you ever have to argue with someone who thinks the next scale-up will fix the failure modes.
(Mirzadeh et al. 2025) GSM-Symbolic. The 65% accuracy drop from a single irrelevant clause. The strongest single piece of brittleness evidence to hand someone who is over-trusting a reasoning model.
(Shojaee et al. 2025) The Illusion of Thinking. Apple’s complexity-collapse result on o3-mini and DeepSeek-R1. Pair with (Lawsen 2025) for the rebuttal-and-walkback arc; the controversy itself is informative.
(Liang et al. 2025) Machine Bullshit. The non-hallucination failure mode. RLHF nearly doubles the Bullshit Index. Belongs in this chapter’s reading list because the failure is structural, not random.

Chapter 16: Evals and Observability

(Chiang et al. 2024) Chatbot Arena. The industry-standard human-preference harness. The chapter focuses on internal eval infrastructure; this is the external benchmark you cannot ignore even when you should.
(Artificial Analysis 2025) AA-Omniscience. Cross-domain knowledge reliability over 6,000 questions and 42 topics. Cite with snapshot date. Useful as a reality check on any single benchmark’s claims.
(Vectara 2026) Vectara’s hallucination leaderboard. Continuously updated; the right place to look when a vendor claims hallucination-free output.

Chapter 17: Multi-Model Routing

(Chiang et al. 2024) Chatbot Arena. Routing decisions implicitly assume some preference ordering; Arena gives you a defensible one. The chapter touches this; the paper is the methodology.
(Artificial Analysis 2025) AA-Omniscience. Cross-domain knowledge reliability across 42 topics. Routing on per-topic strength is only as good as the per-topic measurement; this is the current best public one.

Chapter 18: Cost-Tiered Deployment

(Rajput et al. 2025) Production-Grade Local LLM Inference on Apple Silicon. The chapter cites it for the local-viable claim; the paper’s per-engine numbers (MLX ~230 tok/s, llama.cpp ~150 tok/s on M2 Ultra 192GB) are the right anchor when you are sizing local-vs-cloud break-even.
(Kwon 2025) vLLM tech report. If the chapter pushed you toward self-hosted inference, this is the production engine. Read for serving-side cost characteristics (batching, KV-cache memory, prefix caching) that change the cost math.

Chapter 19: Security and Prompt Injection

(Willison, n.d.) Willison’s tag corpus. The chapter draws on it; subscribe for the running archive. Prompt injection is a moving field and a static reading list ages in months.
(Kalai et al. 2025) Why Language Models Hallucinate. Some prompt-injection failures are not adversarial inputs but the model’s underlying statistical behavior surfacing. Useful framing when you triage incidents.

Chapter 20: Quality Decay

(N. F. Liu et al. 2024) Lost in the Middle. Quality decay over context length is the cleanest empirical case the chapter relies on. Read the original curves, not the summaries.
(Kalai et al. 2025) Why Language Models Hallucinate. Statistical-classification framing is also the cleanest framing for “why does the model degrade as the prompt grows.” The chapter’s empirical observations land harder once this is in your head.

Part V: Bare Metal

Chapter 21: Local vs Cloud

(Rajput et al. 2025) Apple Silicon Comparative Study. The chapter cites it as the local-viability anchor; the paper’s per-engine numbers are the work product, not the abstract.
(Kwon 2025) vLLM tech report. The serving-side engine on the cloud side of the comparison. Reading both (Rajput et al. 2025) and this report side-by-side is the most efficient way to see what each side actually costs.
(NVIDIA 2024) Mastering LLM Techniques: Inference Optimization. NVIDIA’s own measurements; useful as a vendor counterpoint when you are pricing cloud inference.
(Baseten 2024) A guide to LLM inference and performance. Engineering blog from a hosting vendor. Not load-bearing but a clean cost-side companion read.

Chapter 22: Quantization Depth

(Dettmers et al. 2022) LLM.int8(). The chapter cites it; the original paper is the right read for understanding why outlier features matter, which is the theme that runs through every quantization paper since.
(Frantar et al. 2023) GPTQ. Same recommendation: the chapter cites it, but the paper’s mechanics are worth a careful read once you are deep in the quant literature.
(Zandieh et al. 2026) TurboQuant. The chapter covers it at depth; the paper itself is the right primary source for the random-rotation-plus-optimal-scalar argument and the 3-bit KV-cache result.
(Chen et al. 2025) EfficientQAT. 2-bit Llama-70B on a single A100 in 41 hours. Read after the chapter when you want to understand QAT’s practical envelope rather than just PTQ’s.

Chapter 23: VRAM Math and GPUs

(Williams et al. 2009) Roofline. The chapter uses bandwidth-bound vs compute-bound framings constantly; the original roofline paper is the cleanest treatment.
(Anonymous 2025c) Rethinking LLM Inference Bottlenecks. MLA shifts attention toward compute-bound; MHA and GQA remain bandwidth-bound. The chapter touches this; the paper is the empirical case.
(Anonymous 2025a) Bandwidth, Compute, Synchronization, and Capacity are all you need. Four-constraint limit study. A useful reframe once you have the per-component numbers from the chapter.
(Anonymous 2025b) Mind the Memory Gap. 2025 empirical measurement of DRAM-bandwidth dominance in large-batch inference. Pair with the chapter’s VRAM math when you want the bandwidth side, not just the capacity side.

Chapter 24: TurboQuant and Newer Methods

(Chen et al. 2025) EfficientQAT. The chapter is PTQ-heavy; this is the QAT companion at production scale. 2-bit, 70B, single A100, 41 hours, under three accuracy points lost.
(Z. Liu et al. 2024) SpinQuant. Learned rotations via Cayley optimization. The chapter introduces the rotation-based quantization family; this paper is the Meta FAIR variant worth comparing against QuaRot and TurboQuant.

Chapter 25: Foundry Cluster

The chapter is an operational case study; the right “further reading” is the systems papers cited under chapters 4, 21, and 23 ((Kwon 2025), (Williams et al. 2009), (Rajput et al. 2025)) re-read against your own hardware budget. No additional bibliography entries usefully extend it.

Epilogue

(Bender and Koller 2020) Climbing towards NLU. The epilogue’s “tool, not intelligence” thesis is the same argument made by linguists in 2020. Worth re-reading after the epilogue as a check on how much of the recent debate was actually new.
(Liang et al. 2025) Machine Bullshit. If the epilogue’s worry about “plausible paragraphs short-circuiting the formation of a mind” felt like rhetoric, this paper is the empirical case that the plausibility itself is the failure.
(Schaeffer et al. 2023) Are Emergent Abilities a Mirage? The epilogue takes a strong anti-hype line; this is the strongest single technical paper supporting it.

Anonymous. 2025a. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity Are All You Need. arXiv:2507.14397. https://arxiv.org/abs/2507.14397.

Anonymous. 2025b. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference. arXiv:2503.08311. https://arxiv.org/abs/2503.08311.

Anonymous. 2025c. Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465. https://arxiv.org/abs/2507.15465.

Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073.

Baseten. 2024. A Guide to LLM Inference and Performance. Baseten engineering blog. https://www.baseten.co/blog/llm-transformer-inference-guide/.

Bender, Emily M., and Alexander Koller. 2020. “Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). https://aclanthology.org/2020.acl-main.463/.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.

Chen, Mengzhao et al. 2025. “EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2407.11062.

Chiang, Wei-Lin, Lianmin Zheng, et al. 2024. “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.” International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2403.04132.

Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. “LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2208.07339.

Dziri, Nouha, Ximing Lu, Melanie Sclar, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2305.18654.

Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.17323.

Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” International Conference on Learning Representations (ICLR 2020). https://arxiv.org/abs/1904.09751.

Hugging Face. 2026. LLM Course. https://huggingface.co/learn/llm-course.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.

Karpathy, Andrej. 2023a. Intro to Large Language Models. Video lecture. https://www.youtube.com/watch?v=zjkBMFhNj_g.

Karpathy, Andrej. 2023b. Let’s Build GPT: From Scratch, in Code, Spelled Out. Video lecture. https://www.youtube.com/watch?v=kCc8FmEb1nY.

Kwon, Woosuk. 2025. vLLM: An Efficient Inference Engine for Large Language Models. EECS-2025-192. UC Berkeley EECS. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.

Lambert, Nathan. 2026. Reinforcement Learning from Human Feedback. Allen Institute for AI. https://rlhfbook.com/book.pdf.

Lawsen, A. 2025. The Illusion of the Illusion of Thinking. arXiv:2506.09250. https://arxiv.org/abs/2506.09250.

Liang, Kaiqu, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, and Jaime Fernández Fisac. 2025. Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models. Princeton University and UC Berkeley, arXiv:2507.07484. https://arxiv.org/abs/2507.07484.

Lienhart, Pierre. 2024. LLM Inference Series: 5. Dissecting Model Performance. Personal engineering blog (AWS). https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f.

Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.

Liu, Zechun, Changsheng Zhao, et al. 2024. SpinQuant: LLM Quantization with Learned Rotations. Meta FAIR, arXiv:2405.16406. https://arxiv.org/abs/2405.16406.

Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” International Conference on Learning Representations (ICLR 2025). https://arxiv.org/abs/2410.05229.

NVIDIA. 2024. Mastering LLM Techniques: Inference Optimization. NVIDIA Developer Blog. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models Are Unsupervised Multitask Learners. OpenAI technical report. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.

Ravichander, Abhilasha, Ellen Vitercik, Aditya Gupta, et al. 2025. “HalluLens: LLM Hallucination Benchmark.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://arxiv.org/abs/2504.17550.

Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2304.15004.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). https://aclanthology.org/P16-1162/.

Shojaee, Parshin, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research, arXiv:2506.06941. https://arxiv.org/abs/2506.06941.

Stanford University. 2025. CS336: Language Modeling from Scratch (Spring 2025). https://cs336.stanford.edu/.

Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762.

Vectara. 2026. Hallucination Leaderboard (HHEM-2.3 + FaithJudge). GitHub, continuously updated. https://github.com/vectara/hallucination-leaderboard.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2201.11903.

Weng, Lilian. 2024. Extrinsic Hallucinations in LLMs. Lil’Log. https://lilianweng.github.io/posts/2024-07-07-hallucination/.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4). https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf.

Willison, Simon. n.d. Prompt Injection Tag Corpus. Ongoing blog archive. https://simonwillison.net/tags/prompt-injection/.

Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2026. “TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” International Conference on Learning Representations (ICLR 2026, Accepted). https://arxiv.org/abs/2504.19874.