Papers
29 paper summaries tagged ilya-30
📄 **[Read on arXiv](https://arxiv.org/abs/2201.11903)** Wei et al., arXiv 2201.11903, 2022 (NeurIPS 2022). - [Paper](https://arxiv.org/abs/2201.11903) Chain-of-thought (CoT) prompting demonstrates that including interme…
📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…
📄 **[Read on arXiv](https://arxiv.org/abs/2001.08361)** This is the canonical early scaling-law paper for language models, authored by Kaplan et al. at OpenAI. It demonstrated that neural language model cross-entropy lo…
📄 **[Read on arXiv](https://arxiv.org/abs/2006.11239)** Ho, Jain, and Abbeel, NeurIPS, 2020. - [Paper](https://arxiv.org/abs/2006.11239) Denoising Diffusion Probabilistic Models (DDPM) demonstrates that high-quality ima…
📄 **[Read on arXiv](https://arxiv.org/abs/1811.06965)** GPipe introduces micro-batch pipeline parallelism as a practical method for training neural networks too large to fit on a single accelerator. The core idea is to…
📄 **[Read on arXiv](https://arxiv.org/abs/1806.01822)** Traditional RNNs (LSTMs, GRUs) compress all sequential information into a single fixed-size hidden vector, which fundamentally limits their ability to store and re…
📄 **[Read on arXiv](https://arxiv.org/abs/1704.01212)** This paper provided the conceptual unification that the graph neural network field needed. By showing that seemingly different architectures -- GCN, GraphSAGE, Gat…
📄 **[AMS Book Page](https://bookstore.ams.org/surv-220)** This monograph by Shen, Uspensky, and Vereshchagin is the definitive modern reference on algorithmic information theory. The central concept is Kolmogorov comple…
📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…
Santoro, Raposo, Barrett, Malinowski, Pascanu, Battaglia, Lillicrap (DeepMind), NeurIPS, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1706.01427)** Relation Networks (RNs) are a simple neural network module for relat…
📄 **[Read on arXiv](https://arxiv.org/abs/1611.02731)** The Variational Lossy Autoencoder (VLAE) by Chen, Kingma, Salimans, Duan, Dhariwal, Schulman, Sutskever, and Abbeel (2016) addresses the fundamental tension in VAE…
📄 **[Read on arXiv](https://arxiv.org/abs/1511.06391)** This paper by Samy Bengio, Oriol Vinyals, and Manjunath Kudlur challenges a core assumption in sequence modeling: that the order of input and output data is merely…
📄 **[Read on arXiv](https://arxiv.org/abs/1603.05027)** This paper, a follow-up to the original ResNet work, provides both theoretical analysis and empirical evidence that the arrangement of operations within residual b…
📄 **[Read Blog Post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)** Christopher Olah's 2015 blog post is a widely used pedagogical reference for understanding LSTM internals. The post explains why vanilla…
📄 **[Read Blog Post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)** Andrej Karpathy's 2015 blog post offers a vivid qualitative demonstration that character-level recurrent neural networks with LSTM cells c…
📄 **[Read on arXiv](https://arxiv.org/abs/1506.03134)** Pointer Networks repurpose the attention mechanism as an output distribution, replacing the fixed output vocabulary of sequence-to-sequence models with attention w…
📄 **[Read on arXiv](https://arxiv.org/abs/1511.07122)** This paper introduced dilated (atrous) convolutions as a principled alternative to the downsample-then-upsample paradigm for dense prediction tasks. By inserting g…
Amodei et al., ICML, 2016. 📄 **[Read on arXiv](https://arxiv.org/abs/1512.02595)** Deep Speech 2 is an end-to-end speech recognition system where a single RNN trained with CTC loss on spectrograms replaces the entire tr…
📄 **[Read on arXiv](https://arxiv.org/abs/1512.03385)** He, Zhang, Ren, Sun (Microsoft Research), CVPR, 2016. - [Paper](https://arxiv.org/abs/1512.03385) Deep Residual Learning introduces skip connections that add the i…
📄 **[Course Website](https://cs231n.stanford.edu/)** Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing). - [Course](https://cs231n.stanford.edu/) CS231n is a widely used Stanford deep learning for computer v…
📄 **[Read on arXiv](https://arxiv.org/abs/1409.2329)** This paper discovered that dropout can be successfully applied to LSTMs if it is restricted to non-recurrent (feedforward) connections only, preserving the LSTM's a…
📄 **[Read on arXiv](https://arxiv.org/abs/1405.6903)** This paper bridges thermodynamics and computational complexity to formalize a deep intuition: mixing cream into coffee produces increasingly complex patterns (swirl…
📄 **[Read on arXiv](https://arxiv.org/abs/1410.5401)** Neural Turing Machines (NTMs) augment neural networks with a differentiable external memory matrix and soft attention-based read/write heads, enabling them to learn…
📄 **[Read on arXiv](https://arxiv.org/abs/1409.0473)** This paper introduced the attention mechanism to deep learning, arguably the single most influential architectural innovation leading to modern transformers and LLM…
📄 **[Read Paper](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)** AlexNet, as this paper's architecture came to be known, is a deep convolutional neural network trained on GPUs th…
📄 **[Read Blog Post](https://scottaaronson.blog/?p=762)** Scott Aaronson's blog post highlights an asymmetry between entropy and complexity as a way of thinking about structure formation in physical and computational sy…
📄 **[Read Thesis](https://www.vetta.org/documents/Machine_Super_Intelligence.pdf)** Shane Legg's 2008 PhD thesis provides perhaps the most rigorous mathematical definition of general intelligence, grounding informal int…
📄 **[Read on arXiv](https://arxiv.org/abs/math/0406077)** Grünwald, arXiv math/0406077 / MIT Press, 2004. - [Paper](https://arxiv.org/abs/math/0406077) The Minimum Description Length (MDL) principle formalizes Occam's r…
📄 **[Read Paper](https://www.cs.toronto.edu/~hinton/absps/colt93.pdf)** This paper by Hinton and van Camp bridges information theory and neural network generalization by proposing that model complexity should be measure…