Ilya Top 30
Ilya Sutskever's curated reading list of ~30 papers and resources spanning the conceptual foundations of deep learning, from architecture breakthroughs to information theory and complexity. This list circulated widely as a recommended curriculum for understanding the intellectual roots of modern AI.
Canonical list
| # | Title | Year | Wiki page |
|---|---|---|---|
| 1 | The Annotated Transformer (Attention Is All You Need) | 2017 | Attention Is All You Need |
| 2 | The First Law of Complexodynamics (Aaronson) | 2011 | The First Law Of Complexodynamics |
| 3 | The Unreasonable Effectiveness of Recurrent Neural Networks (Karpathy) | 2015 | The Unreasonable Effectiveness Of Recurrent Neural Networks |
| 4 | Understanding LSTM Networks (Olah) | 2015 | Understanding Lstm Networks |
| 5 | Recurrent Neural Network Regularization (Zaremba et al.) | 2014 | Recurrent Neural Network Regularization |
| 6 | Keeping Neural Networks Simple by Minimizing the Description Length of the Weights (Hinton & van Camp) | 1993 | Keeping Neural Networks Simple By Minimizing Description Length |
| 7 | Pointer Networks (Vinyals et al.) | 2015 | Pointer Networks |
| 8 | ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) | 2012 | Imagenet Classification With Deep Convolutional Neural Networks |
| 9 | Order Matters: Sequence to Sequence for Sets (Vinyals et al.) | 2016 | Order Matters Sequence To Sequence For Sets |
| 10 | GPipe: Efficient Training of Giant Neural Nets (Huang et al.) | 2019 | Gpipe Efficient Training Of Giant Neural Nets |
| 11 | Deep Residual Learning for Image Recognition (ResNet) | 2015 | Deep Residual Learning For Image Recognition |
| 12 | Multi-Scale Context Aggregation by Dilated Convolutions (Yu & Koltun) | 2016 | Multi Scale Context Aggregation By Dilated Convolutions |
| 13 | Neural Message Passing for Quantum Chemistry (Gilmer et al.) | 2017 | Neural Message Passing For Quantum Chemistry |
| 14 | Attention Is All You Need (Vaswani et al.) | 2017 | Attention Is All You Need |
| 15 | Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al.) | 2014 | Neural Machine Translation By Jointly Learning To Align And Translate |
| 16 | Identity Mappings in Deep Residual Networks (He et al.) | 2016 | Identity Mappings In Deep Residual Networks |
| 17 | A Simple Neural Network Module for Relational Reasoning (Relation Networks) | 2017 | A Simple Neural Network Module For Relational Reasoning |
| 18 | Variational Lossy Autoencoder (Chen et al.) | 2017 | Variational Lossy Autoencoder |
| 19 | Relational Recurrent Neural Networks (Santoro et al.) | 2018 | Relational Recurrent Neural Networks |
| 20 | Quantifying the Rise and Fall of Complexity in Closed Systems (Coffee Automaton) | 2014 | Quantifying The Rise And Fall Of Complexity In Closed Systems |
| 21 | Neural Turing Machines (Graves et al.) | 2014 | Neural Turing Machines |
| 22 | Deep Speech 2 (Amodei et al.) | 2015 | Deep Speech 2 |
| 23 | Scaling Laws for Neural Language Models (Kaplan et al.) | 2020 | Scaling Laws For Neural Language Models |
| 24 | A Tutorial Introduction to the Minimum Description Length Principle (Grünwald) | 2004 | A Tutorial Introduction To The Minimum Description Length Principle |
| 25 | Machine Super Intelligence (Legg) | 2008 | Machine Super Intelligence |
| 26 | Kolmogorov Complexity and Algorithmic Randomness (Li & Vitányi / Shen et al.) | 2017 | Kolmogorov Complexity And Algorithmic Randomness |
| 27 | CS231n: Convolutional Neural Networks for Visual Recognition (Stanford course) | 2015 | Cs231N Convolutional Neural Networks For Visual Recognition |
| 28 | Denoising Diffusion Probabilistic Models (Ho et al.) | 2020 | Denoising Diffusion Probabilistic Models |
| 29 | An Image Is Worth 16x16 Words: Vision Transformer (ViT) | 2020 | An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale |
| 30 | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al.) | 2022 | Chain Of Thought Prompting Elicits Reasoning |
Thematic clusters
Architectures: Transformer (#1/#14), ResNet (#11/#16), AlexNet (#8), ViT (#29), GPipe (#10)
Sequence modeling: RNNs (#3), LSTMs (#4), RNN regularization (#5), Pointer Networks (#7), Seq2Seq for sets (#9), Deep Speech 2 (#22)
Attention and relational reasoning: Bahdanau attention (#15), Relation Networks (#17), Relational RNNs (#19)
Information theory and compression: MDL/Hinton (#6), Grünwald MDL tutorial (#24), Kolmogorov complexity (#26)
Complexity and intelligence: Complexodynamics (#2), Coffee Automaton (#20), Machine Super Intelligence (#25)
Graph and message passing: Neural Message Passing (#13), Dilated Convolutions (#12)
Scaling and modern methods: Scaling Laws (#23), Diffusion Models (#28), Chain-of-Thought (#30)
Memory and computation: Neural Turing Machines (#21)
Courses: CS231n (#27)
Why this list matters
The list reveals Ilya's emphasis on compression, complexity, and information-theoretic foundations alongside practical architecture breakthroughs. The inclusion of Kolmogorov complexity, MDL, and complexodynamics papers signals that intelligence is deeply connected to compression — a theme that runs through scaling laws and modern LLM capabilities.