Appendices

Citations by Tier

The full bibliography is auto-generated at the back of the book, alphabetized by citation key. This appendix groups the same citations by source tier, so you can see at a glance what level of evidence underlies each claim.

T1: Peer-reviewed, canonical, top-university material

[NeurIPS, ICML, ICLR, ACL, EMNLP, FAccT, USENIX Security, arXiv from established authors, top-university course material.]

Foundation architecture

  • (Vaswani et al. 2017) Vaswani, Shazeer, Parmar et al. (2017). Attention Is All You Need. NeurIPS 2017.

In-context learning

  • (Brown et al. 2020) Brown, Mann, Ryder et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. Note: GPT-3 paper; canonical reference for in-context learning.

Alignment / RLHF

  • (Christiano et al. 2017) Christiano, Leike, Brown et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS 2017. Note: origin of RLHF.
  • (Ouyang et al. 2022) Ouyang, Wu, Jiang et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. Note: InstructGPT; primary RLHF citation for LLM textbook context.
  • (Bai et al. 2022) Bai, Kadavath, Kundu et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. Note: Anthropic; RLAIF / constitutional method.
  • (Rafailov et al. 2023) Rafailov, Sharma, Mitchell et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. Note: DPO; what modern labs actually use.

Scaling laws

  • (Kaplan et al. 2020) Kaplan, McCandlish, Henighan et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
  • (Hoffmann et al. 2022) Hoffmann, Borgeaud, Mensch et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. Note: Chinchilla; 20-tokens-per-parameter rule.

Reasoning critiques

  • (Dziri et al. 2023) Dziri, Lu, Sclar et al. (2023). Faith and Fate: Limits of Transformers on Compositionality. NeurIPS 2023 Spotlight. Note: must-cite for pattern-matching-vs-reasoning.
  • (Mirzadeh et al. 2025) Mirzadeh, Alizadeh, Shahrokhi et al. (2025). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. ICLR 2025. Note: 65% accuracy drop from irrelevant clause; strongest brittleness evidence.
  • (Shojaee et al. 2025) Shojaee, Mirzadeh, Alizadeh et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research, arXiv:2506.06941. Note: accuracy-collapse-with-complexity on o3-mini, DeepSeek-R1.
  • (Wei et al. 2022) Wei, Wang, Schuurmans et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. Note: foundational CoT paper.
  • (Schaeffer et al. 2023) Schaeffer, Miranda, Koyejo (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023 Outstanding Paper.
  • (Bender and Koller 2020) Bender, Koller (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. ACL 2020. Note: form-without-grounded-meaning framing.

Hallucination

  • (Kalai et al. 2025) Kalai, Nachum, Vempala, Zhang (2025). Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. Note: must-cite 2025 paper; statistical-classification framing.
  • (Ravichander et al. 2025) Ravichander, Vitercik, Gupta et al. (2025). HalluLens: LLM Hallucination Benchmark. ACL 2025 (Long Papers). Note: Meta FAIR + HKUST; extrinsic/intrinsic taxonomy.
  • (Liang et al. 2025) Liang, Hu, Zhao, Song, Griffiths, Fernandez Fisac (2025). Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models. Princeton / UC Berkeley, arXiv:2507.07484. Note: Frankfurt “bullshit” concept applied to LLMs; introduces Bullshit Index plus four-form taxonomy (empty rhetoric, paltering, weasel words, unverified claims). Finds RLHF nearly doubles the index (0.38 to ~1.0) while user satisfaction rises 48%. CoT amplifies certain forms. Distinct from both hallucination and sycophancy.

Tokenization and sampling

  • (Sennrich et al. 2016) Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. Note: BPE paper; still the reference tokenization method.
  • (Holtzman et al. 2020) Holtzman, Buys, Du, Forbes, Choi (2020). The Curious Case of Neural Text Degeneration. ICLR 2020. Note: canonical nucleus-sampling paper.

Long context and attention systems

  • (N. F. Liu et al. 2024) Liu, Lin, Hewitt et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. Note: advertised context is not usable context.
  • (Dao et al. 2022) Dao, Fu, Ermon, Rudra, Re (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. Note: IO-awareness reframe; attention cost is memory-movement cost.
  • (Dao 2023) Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.
  • (Shah et al. 2024) Shah, Bikshandi, Zhang, Thakkar et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv:2407.08608.

Inference systems / serving

  • (Kwon et al. 2023) Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. Note: must-cite for KV-cache memory management.
  • (Kwon 2025) Kwon (2025). vLLM: An Efficient Inference Engine for Large Language Models. UC Berkeley EECS Tech Report EECS-2025-192. Note: 2025 full-system writeup.
  • (Sheng et al. 2023) Sheng, Zheng, Yuan, Li, Ryabinin, Fu, Xie, Chen et al. (2023). FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. ICML 2023. Note: offloading-aware scheduling; why offloading pays bandwidth cost.

Hardware / bandwidth

  • (Williams et al. 2009) Williams, Waterman, Patterson (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM 52(4). Note: textbook-quality roofline reference.
  • (Anonymous 2025b) Anonymous (2025). Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference. arXiv:2503.08311. Note: 2025 empirical measurement of DRAM-bandwidth dominance.
  • (Anonymous 2025a) Anonymous (2025). Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need. arXiv:2507.14397. Note: four-constraint limit study.
  • (Anonymous 2025c) Anonymous (2025). Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465. Note: MLA shifts attention toward compute-bound; MHA/GQA remain bandwidth-bound.

Quantization

  • (Dettmers et al. 2022) Dettmers, Lewis, Belkada, Zettlemoyer (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. Note: 8-bit-without-accuracy-loss landmark.
  • (Frantar et al. 2023) Frantar, Ashkboos, Hoefler, Alistarh (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023. Note: 3-4 bit PTQ baseline.
  • (Zandieh et al. 2026) Zandieh, Daliri, Hadian, Mirrokni (2026). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026 (accepted). Note: Google Research; random rotation plus optimal scalar quantizer; 3-bit KV-cache with zero accuracy loss.
  • (Ashkboos et al. 2024) Ashkboos, Mohtashami, Croci, Li, Jaggi, Alistarh, Tizi, Hensman (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456. Note: fixed Hadamard rotation; end-to-end 4-bit matmul.
  • (Z. Liu et al. 2024) Liu, Zhao et al. (2024). SpinQuant: LLM Quantization with Learned Rotations. Meta FAIR, arXiv:2405.16406. Note: learned rotations via Cayley optimization.
  • (Chen et al. 2025) Chen et al. (2025). EfficientQAT: Efficient Quantization-Aware Training for Large Language Models. ACL 2025. Note: 2-bit Llama-70B on single A100 in 41h, under 3 accuracy points loss.

Benchmarks

  • (Chiang et al. 2024) Chiang, Zheng et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ICML 2024. Note: industry-standard harness.

Apple Silicon / local inference

  • (Rajput et al. 2025) Rajput et al. (2025). Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS. arXiv:2511.05502. Note: M2 Ultra 192GB; MLX ~230 tok/s, llama.cpp ~150 tok/s; must-cite for local-inference-is-viable claim.

Course material

  • (Stanford University 2025) Stanford University (2025). CS336: Language Modeling from Scratch (Spring 2025). Note: 2025 refresh of Stanford language-modeling pedagogy.

T2: Named-author vendor research, official docs, standards bodies

[Model-vendor research blogs with named authors, vendor documentation, standards bodies, HuggingFace LLM Course.]

  • (Radford et al. 2019) Radford, Wu, Child, Luan, Amodei, Sutskever (2019). Language Models are Unsupervised Multitask Learners. OpenAI technical report. Note: vendor research.
  • (OpenAI 2023) OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
  • (Anthropic 2026) Anthropic (2026). Messages API reference. Current product documentation. Note: first-party evidence of stateless chat wrapper.
  • (Hugging Face 2026) Hugging Face (2026). LLM Course. Note: best freely-available canonical walkthrough.
  • (Lambert 2026) Lambert (2026). Reinforcement Learning from Human Feedback. Allen Institute for AI. Note: single best contemporary RLHF reference.
  • (NVIDIA 2024) NVIDIA (2024). Mastering LLM Techniques: Inference Optimization. NVIDIA Developer Blog. Note: first-party vendor measurement.
  • (Baseten 2024) Baseten (2024). A guide to LLM inference and performance. Baseten engineering blog.
  • (Weng 2024) Weng (2024). Extrinsic Hallucinations in LLMs. Lil’Log. Note: Weng cited as named senior researcher; orient before diving into papers.
  • (Templeton et al. 2024) Templeton, Conerly, Marcus, Lindsey, Bricken, Chen et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. Note: for mechanistic-interpretability counterpoint to “facts are not keyed storage.”

T3: Published leaderboards with methodology

[LMArena, SWE-bench, AA-Omniscience, Vectara Hallucination Leaderboard.]

  • (Artificial Analysis 2025) Artificial Analysis (2025). AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. Note: 6000 questions, 42 topics; cite with snapshot date.
  • (Vectara 2026) Vectara (2026). Hallucination Leaderboard (HHEM-2.3 + FaithJudge). GitHub, continuously updated. Note: cite with snapshot date.

T4: Edu-tier individuals, cited as context

[Karpathy, Chris Olah, Lilian Weng, Simon Willison (for prompt injection specifically).]

  • (Karpathy 2023a) Karpathy (2023). Intro to Large Language Models. Video lecture.
  • (Karpathy 2023b) Karpathy (2023). Let’s build GPT: from scratch, in code, spelled out. Video lecture. Note: best artifact for seeing tokens-to-tokens in 200 lines.
  • (Karpathy 2024) Karpathy (2024). Let’s build the GPT Tokenizer. Video lecture. Note: BPE mechanics hands-on.
  • (Willison, n.d.) Willison. Prompt injection tag corpus. Ongoing blog archive. Note: archivist for prompt-injection field.
  • (Lawsen 2025) Lawsen (2025). The Illusion of the Illusion of Thinking. arXiv:2506.09250. Note: cited as context; rebuttal walked back by author.
  • (Lienhart 2024) Lienhart (2024). LLM Inference Series: 5. Dissecting model performance. Personal engineering blog (AWS). Note: rigorous arithmetic-intensity walkthrough.

Tier annotations also appear inline in references.bib’s note field.

Anonymous. 2025a. Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity Are All You Need. arXiv:2507.14397. https://arxiv.org/abs/2507.14397.
Anonymous. 2025b. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference. arXiv:2503.08311. https://arxiv.org/abs/2503.08311.
Anonymous. 2025c. Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465. https://arxiv.org/abs/2507.15465.
Anthropic. 2026. Messages API Reference. Current product documentation. https://docs.anthropic.com/en/api/messages.
Artificial Analysis. 2025. AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models. arXiv:2511.13029. https://artificialanalysis.ai/evaluations/omniscience.
Ashkboos, Saleh, Amirkeivan Mohtashami, Maximilian L. Croci, et al. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. arXiv:2404.00456. https://arxiv.org/abs/2404.00456.
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073.
Baseten. 2024. A Guide to LLM Inference and Performance. Baseten engineering blog. https://www.baseten.co/blog/llm-transformer-inference-guide/.
Bender, Emily M., and Alexander Koller. 2020. “Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). https://aclanthology.org/2020.acl-main.463/.
Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://arxiv.org/abs/2005.14165.
Chen, Mengzhao et al. 2025. EfficientQAT: Efficient Quantization-Aware Training for Large Language Models.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). https://arxiv.org/abs/2407.11062.
Chiang, Wei-Lin, Lianmin Zheng, et al. 2024. “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.” International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2403.04132.
Christiano, Paul F., Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03741.
Dao, Tri. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2205.14135.
Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2208.07339.
Dziri, Nouha, Ximing Lu, Melanie Sclar, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2305.18654.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.17323.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.15556.
Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” International Conference on Learning Representations (ICLR 2020). https://arxiv.org/abs/1904.09751.
Hugging Face. 2026. LLM Course. https://huggingface.co/learn/llm-course.
Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Ying Zhang. 2025. Why Language Models Hallucinate. OpenAI / Georgia Tech, arXiv:2509.04664. https://arxiv.org/abs/2509.04664.
Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361.
Karpathy, Andrej. 2023a. Intro to Large Language Models. Video lecture. https://www.youtube.com/watch?v=zjkBMFhNj_g.
Karpathy, Andrej. 2023b. Let’s Build GPT: From Scratch, in Code, Spelled Out. Video lecture. https://www.youtube.com/watch?v=kCc8FmEb1nY.
Karpathy, Andrej. 2024. Let’s Build the GPT Tokenizer. Video lecture. https://www.youtube.com/watch?v=zduSFxRajkE.
Kwon, Woosuk. 2025. vLLM: An Efficient Inference Engine for Large Language Models. EECS-2025-192. UC Berkeley EECS. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM Symposium on Operating Systems Principles (SOSP 2023). https://arxiv.org/abs/2309.06180.
Lambert, Nathan. 2026. Reinforcement Learning from Human Feedback. Allen Institute for AI. https://rlhfbook.com/book.pdf.
Lawsen, A. 2025. The Illusion of the Illusion of Thinking. arXiv:2506.09250. https://arxiv.org/abs/2506.09250.
Liang, Kaiqu, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, and Jaime Fernández Fisac. 2025. Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models. Princeton University and UC Berkeley, arXiv:2507.07484. https://arxiv.org/abs/2507.07484.
Lienhart, Pierre. 2024. LLM Inference Series: 5. Dissecting Model Performance. Personal engineering blog (AWS). https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f.
Liu, Nelson F., Kevin Lin, John Hewitt, et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172.
Liu, Zechun, Changsheng Zhao, et al. 2024. SpinQuant: LLM Quantization with Learned Rotations. Meta FAIR, arXiv:2405.16406. https://arxiv.org/abs/2405.16406.
Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” International Conference on Learning Representations (ICLR 2025). https://arxiv.org/abs/2410.05229.
NVIDIA. 2024. Mastering LLM Techniques: Inference Optimization. NVIDIA Developer Blog. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. https://arxiv.org/abs/2303.08774.
Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models Are Unsupervised Multitask Learners. OpenAI technical report. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2305.18290.
Rajput, Subhash et al. 2025. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, Llama.cpp, and PyTorch MPS. arXiv:2511.05502. https://arxiv.org/abs/2511.05502.
Ravichander, Abhilasha, Ellen Vitercik, Aditya Gupta, et al. 2025. “HalluLens: LLM Hallucination Benchmark.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Long Papers. https://arxiv.org/abs/2504.17550.
Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2304.15004.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). https://aclanthology.org/P16-1162/.
Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, et al. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. arXiv:2407.08608. https://arxiv.org/abs/2407.08608.
Sheng, Ying, Lianmin Zheng, Binhang Yuan, et al. 2023. “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.” International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2303.06865.
Shojaee, Parshin, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research, arXiv:2506.06941. https://arxiv.org/abs/2506.06941.
Stanford University. 2025. CS336: Language Modeling from Scratch (Spring 2025). https://cs336.stanford.edu/.
Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762.
Vectara. 2026. Hallucination Leaderboard (HHEM-2.3 + FaithJudge). GitHub, continuously updated. https://github.com/vectara/hallucination-leaderboard.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2201.11903.
Weng, Lilian. 2024. Extrinsic Hallucinations in LLMs. Lil’Log. https://lilianweng.github.io/posts/2024-07-07-hallucination/.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4). https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf.
Willison, Simon. n.d. Prompt Injection Tag Corpus. Ongoing blog archive. https://simonwillison.net/tags/prompt-injection/.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2026. TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” International Conference on Learning Representations (ICLR 2026, Accepted). https://arxiv.org/abs/2504.19874.