ESC

Ilya Top 30

Ilya Sutskever's curated reading list of ~30 papers and resources spanning the conceptual foundations of deep learning, from architecture breakthroughs to information theory and complexity. This list circulated widely as a recommended curriculum for understanding the intellectual roots of modern AI.

Canonical list

# Title Year Wiki page
1 The Annotated Transformer (Attention Is All You Need) 2017 Attention Is All You Need
2 The First Law of Complexodynamics (Aaronson) 2011 The First Law Of Complexodynamics
3 The Unreasonable Effectiveness of Recurrent Neural Networks (Karpathy) 2015 The Unreasonable Effectiveness Of Recurrent Neural Networks
4 Understanding LSTM Networks (Olah) 2015 Understanding Lstm Networks
5 Recurrent Neural Network Regularization (Zaremba et al.) 2014 Recurrent Neural Network Regularization
6 Keeping Neural Networks Simple by Minimizing the Description Length of the Weights (Hinton & van Camp) 1993 Keeping Neural Networks Simple By Minimizing Description Length
7 Pointer Networks (Vinyals et al.) 2015 Pointer Networks
8 ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) 2012 Imagenet Classification With Deep Convolutional Neural Networks
9 Order Matters: Sequence to Sequence for Sets (Vinyals et al.) 2016 Order Matters Sequence To Sequence For Sets
10 GPipe: Efficient Training of Giant Neural Nets (Huang et al.) 2019 Gpipe Efficient Training Of Giant Neural Nets
11 Deep Residual Learning for Image Recognition (ResNet) 2015 Deep Residual Learning For Image Recognition
12 Multi-Scale Context Aggregation by Dilated Convolutions (Yu & Koltun) 2016 Multi Scale Context Aggregation By Dilated Convolutions
13 Neural Message Passing for Quantum Chemistry (Gilmer et al.) 2017 Neural Message Passing For Quantum Chemistry
14 Attention Is All You Need (Vaswani et al.) 2017 Attention Is All You Need
15 Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al.) 2014 Neural Machine Translation By Jointly Learning To Align And Translate
16 Identity Mappings in Deep Residual Networks (He et al.) 2016 Identity Mappings In Deep Residual Networks
17 A Simple Neural Network Module for Relational Reasoning (Relation Networks) 2017 A Simple Neural Network Module For Relational Reasoning
18 Variational Lossy Autoencoder (Chen et al.) 2017 Variational Lossy Autoencoder
19 Relational Recurrent Neural Networks (Santoro et al.) 2018 Relational Recurrent Neural Networks
20 Quantifying the Rise and Fall of Complexity in Closed Systems (Coffee Automaton) 2014 Quantifying The Rise And Fall Of Complexity In Closed Systems
21 Neural Turing Machines (Graves et al.) 2014 Neural Turing Machines
22 Deep Speech 2 (Amodei et al.) 2015 Deep Speech 2
23 Scaling Laws for Neural Language Models (Kaplan et al.) 2020 Scaling Laws For Neural Language Models
24 A Tutorial Introduction to the Minimum Description Length Principle (Grünwald) 2004 A Tutorial Introduction To The Minimum Description Length Principle
25 Machine Super Intelligence (Legg) 2008 Machine Super Intelligence
26 Kolmogorov Complexity and Algorithmic Randomness (Li & Vitányi / Shen et al.) 2017 Kolmogorov Complexity And Algorithmic Randomness
27 CS231n: Convolutional Neural Networks for Visual Recognition (Stanford course) 2015 Cs231N Convolutional Neural Networks For Visual Recognition
28 Denoising Diffusion Probabilistic Models (Ho et al.) 2020 Denoising Diffusion Probabilistic Models
29 An Image Is Worth 16x16 Words: Vision Transformer (ViT) 2020 An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale
30 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al.) 2022 Chain Of Thought Prompting Elicits Reasoning

Thematic clusters

Architectures: Transformer (#1/#14), ResNet (#11/#16), AlexNet (#8), ViT (#29), GPipe (#10)

Sequence modeling: RNNs (#3), LSTMs (#4), RNN regularization (#5), Pointer Networks (#7), Seq2Seq for sets (#9), Deep Speech 2 (#22)

Attention and relational reasoning: Bahdanau attention (#15), Relation Networks (#17), Relational RNNs (#19)

Information theory and compression: MDL/Hinton (#6), Grünwald MDL tutorial (#24), Kolmogorov complexity (#26)

Complexity and intelligence: Complexodynamics (#2), Coffee Automaton (#20), Machine Super Intelligence (#25)

Graph and message passing: Neural Message Passing (#13), Dilated Convolutions (#12)

Scaling and modern methods: Scaling Laws (#23), Diffusion Models (#28), Chain-of-Thought (#30)

Memory and computation: Neural Turing Machines (#21)

Courses: CS231n (#27)

Why this list matters

The list reveals Ilya's emphasis on compression, complexity, and information-theoretic foundations alongside practical architecture breakthroughs. The inclusion of Kolmogorov complexity, MDL, and complexodynamics papers signals that intelligence is deeply connected to compression — a theme that runs through scaling laws and modern LLM capabilities.