Papers | ML Systems Wiki

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022 NeurIPS 2022 16871

📄 **[Read on arXiv](https://arxiv.org/abs/2201.11903)** Wei et al., arXiv 2201.11903, 2022 (NeurIPS 2022). - [Paper](https://arxiv.org/abs/2201.11903) Chain-of-thought (CoT) prompting demonstrates that including interme…

paper ilya-30 llm prompting +2

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2021 ICLR 2021 91128

📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…

ilya-30 vision-transformer computer-vision transformer +2

Scaling Laws for Neural Language Models

2020 arXiv 7436

📄 **[Read on arXiv](https://arxiv.org/abs/2001.08361)** This is the canonical early scaling-law paper for language models, authored by Kaplan et al. at OpenAI. It demonstrated that neural language model cross-entropy lo…

paper ilya-30 llm scaling +1

Denoising Diffusion Probabilistic Models

2020 NeurIPS 2020 28939

📄 **[Read on arXiv](https://arxiv.org/abs/2006.11239)** Ho, Jain, and Abbeel, NeurIPS, 2020. - [Paper](https://arxiv.org/abs/2006.11239) Denoising Diffusion Probabilistic Models (DDPM) demonstrates that high-quality ima…

paper ilya-30 generative-models diffusion +1

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

2019 NeurIPS 2100

📄 **[Read on arXiv](https://arxiv.org/abs/1811.06965)** GPipe introduces micro-batch pipeline parallelism as a practical method for training neural networks too large to fit on a single accelerator. The core idea is to…

paper ilya-30 distributed-training pipeline-parallelism +2

Relational Recurrent Neural Networks

2018 NeurIPS 2018 220

📄 **[Read on arXiv](https://arxiv.org/abs/1806.01822)** Traditional RNNs (LSTMs, GRUs) compress all sequential information into a single fixed-size hidden vector, which fundamentally limits their ability to store and re…

paper ilya-30 recurrent-neural-networks relational-reasoning +1

Neural Message Passing For Quantum Chemistry

2017 ICML 2017 8754

📄 **[Read on arXiv](https://arxiv.org/abs/1704.01212)** This paper provided the conceptual unification that the graph neural network field needed. By showing that seemingly different architectures -- GCN, GraphSAGE, Gat…

paper ilya-30 graph-neural-networks molecular-property-prediction +1

Kolmogorov Complexity and Algorithmic Randomness

2017 AMS Mathematical Surveys and Monographs 106

📄 **[AMS Book Page](https://bookstore.ams.org/surv-220)** This monograph by Shen, Uspensky, and Vereshchagin is the definitive modern reference on algorithmic information theory. The central concept is Kolmogorov comple…

paper ilya-30 information-theory kolmogorov-complexity +2

Attention Is All You Need

2017 NeurIPS 171783

📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…

paper ilya-30 llm transformer +3

A Simple Neural Network Module for Relational Reasoning

2017 NeurIPS 2017 1679

Santoro, Raposo, Barrett, Malinowski, Pascanu, Battaglia, Lillicrap (DeepMind), NeurIPS, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1706.01427)** Relation Networks (RNs) are a simple neural network module for relat…

paper ilya-30 relational-reasoning visual-question-answering +1

Variational Lossy Autoencoder

2016 ICLR 2017 700

📄 **[Read on arXiv](https://arxiv.org/abs/1611.02731)** The Variational Lossy Autoencoder (VLAE) by Chen, Kingma, Salimans, Duan, Dhariwal, Schulman, Sutskever, and Abbeel (2016) addresses the fundamental tension in VAE…

paper ilya-30 generative-models variational-autoencoders +1

Order Matters Sequence To Sequence For Sets

2016 ICLR 1018

📄 **[Read on arXiv](https://arxiv.org/abs/1511.06391)** This paper by Samy Bengio, Oriol Vinyals, and Manjunath Kudlur challenges a core assumption in sequence modeling: that the order of input and output data is merely…

paper ilya-30 sequence-to-sequence set-modeling +2

Identity Mappings in Deep Residual Networks

2016 ECCV 2016 11060

📄 **[Read on arXiv](https://arxiv.org/abs/1603.05027)** This paper, a follow-up to the original ResNet work, provides both theoretical analysis and empirical evidence that the arrangement of operations within residual b…

paper ilya-30 residual-networks computer-vision +1

Understanding LSTM Networks

2015 Blog Post (colah.github.io)

📄 **[Read Blog Post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)** Christopher Olah's 2015 blog post is a widely used pedagogical reference for understanding LSTM internals. The post explains why vanilla…

paper ilya-30 lstm rnn +2

The Unreasonable Effectiveness of Recurrent Neural Networks

2015 Blog Post (karpathy.github.io)

📄 **[Read Blog Post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)** Andrej Karpathy's 2015 blog post offers a vivid qualitative demonstration that character-level recurrent neural networks with LSTM cells c…

paper ilya-30 rnn lstm +2

Pointer Networks

2015 NeurIPS 3380

📄 **[Read on arXiv](https://arxiv.org/abs/1506.03134)** Pointer Networks repurpose the attention mechanism as an output distribution, replacing the fixed output vocabulary of sequence-to-sequence models with attention w…

paper ilya-30 attention sequence-to-sequence +2

Multi Scale Context Aggregation By Dilated Convolutions

2015 ICLR 2016 9295

📄 **[Read on arXiv](https://arxiv.org/abs/1511.07122)** This paper introduced dilated (atrous) convolutions as a principled alternative to the downsample-then-upsample paradigm for dense prediction tasks. By inserting g…

paper ilya-30 computer-vision semantic-segmentation +1

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

2015 ICML 2016 3131

Amodei et al., ICML, 2016. 📄 **[Read on arXiv](https://arxiv.org/abs/1512.02595)** Deep Speech 2 is an end-to-end speech recognition system where a single RNN trained with CTC loss on spectrograms replaces the entire tr…

paper ilya-30 speech-recognition end-to-end-learning +1

Deep Residual Learning for Image Recognition

2015 CVPR 2016 224592

📄 **[Read on arXiv](https://arxiv.org/abs/1512.03385)** He, Zhang, Ren, Sun (Microsoft Research), CVPR, 2016. - [Paper](https://arxiv.org/abs/1512.03385) Deep Residual Learning introduces skip connections that add the i…

paper ilya-30 computer-vision residual-networks +1

CS231n: Deep Learning for Computer Vision

2015 Stanford University Course

📄 **[Course Website](https://cs231n.stanford.edu/)** Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing). - [Course](https://cs231n.stanford.edu/) CS231n is a widely used Stanford deep learning for computer v…

paper ilya-30 computer-vision convolutional-neural-networks +2

Recurrent Neural Network Regularization

2014 arXiv (1409.2329) 2986

📄 **[Read on arXiv](https://arxiv.org/abs/1409.2329)** This paper discovered that dropout can be successfully applied to LSTMs if it is restricted to non-recurrent (feedforward) connections only, preserving the LSTM's a…

paper ilya-30 rnn lstm +3

Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton

2014 arXiv 2014 26

📄 **[Read on arXiv](https://arxiv.org/abs/1405.6903)** This paper bridges thermodynamics and computational complexity to formalize a deep intuition: mixing cream into coffee produces increasingly complex patterns (swirl…

paper ilya-30 complexity-theory information-theory +1

Neural Turing Machines

2014 arXiv (presented at NIPS 2014 workshop) 2505

📄 **[Read on arXiv](https://arxiv.org/abs/1410.5401)** Neural Turing Machines (NTMs) augment neural networks with a differentiable external memory matrix and soft attention-based read/write heads, enabling them to learn…

paper ilya-30 memory-augmented-networks attention +1

Neural Machine Translation by Jointly Learning to Align and Translate

2014 ICLR 2015 29150

📄 **[Read on arXiv](https://arxiv.org/abs/1409.0473)** This paper introduced the attention mechanism to deep learning, arguably the single most influential architectural innovation leading to modern transformers and LLM…

paper ilya-30 attention-mechanism machine-translation +1

ImageNet Classification with Deep Convolutional Neural Networks

2012 NeurIPS 2012 127906

📄 **[Read Paper](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)** AlexNet, as this paper's architecture came to be known, is a deep convolutional neural network trained on GPUs th…

paper ilya-30 cnn computer-vision +3

The First Law of Complexodynamics

2011 Blog Post (Shtetl-Optimized)

📄 **[Read Blog Post](https://scottaaronson.blog/?p=762)** Scott Aaronson's blog post highlights an asymmetry between entropy and complexity as a way of thinking about structure formation in physical and computational sy…

paper ilya-30 complexity-theory information-theory +1

Machine Super Intelligence

2008 PhD Thesis, University of Lugano 63

📄 **[Read Thesis](https://www.vetta.org/documents/Machine_Super_Intelligence.pdf)** Shane Legg's 2008 PhD thesis provides perhaps the most rigorous mathematical definition of general intelligence, grounding informal int…

paper ilya-30 agi intelligence-measurement +1

A Tutorial Introduction to the Minimum Description Length Principle

2004 arXiv / MIT Press 381

📄 **[Read on arXiv](https://arxiv.org/abs/math/0406077)** Grünwald, arXiv math/0406077 / MIT Press, 2004. - [Paper](https://arxiv.org/abs/math/0406077) The Minimum Description Length (MDL) principle formalizes Occam's r…

paper ilya-30 information-theory model-selection +2

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

1993 COLT 1279

📄 **[Read Paper](https://www.cs.toronto.edu/~hinton/absps/colt93.pdf)** This paper by Hinton and van Camp bridges information theory and neural network generalization by proposing that model complexity should be measure…

paper ilya-30 regularization information-theory +3