CS231n: Deep Learning for Computer Vision

Citation

Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing).

Canonical link

Course

Overview

CS231n is a widely used Stanford deep learning for computer vision course, taught by Fei-Fei Li, Andrej Karpathy, and Justin Johnson, covering topics from backpropagation through computational graphs to modern architectures, transfer learning, and generative models. Within this wiki it is treated as foundational background material rather than as a source of novel experimental claims.

The course's open lectures, notes, and assignments (implementing k-NN, SVMs, two-layer nets, CNNs, and RNNs from scratch in NumPy/PyTorch) provide hands-on coverage of gradient flow, weight initialization, batch normalization, and architectural design. Its practical emphasis on pretrained features and transfer learning made it an influential teaching resource for applied computer vision.

Ilya Sutskever's inclusion of this course in his reading list highlights it as useful background for understanding canonical vision architectures and training practice. Because it is a course resource rather than a research paper, some of the framing here is necessarily interpretive rather than paper-style claim extraction.

Key Contributions

Computational graph framework for backpropagation: Presents neural networks as DAGs where each node computes a local gradient, making the chain rule intuitive and implementation-friendly
CNN architecture design principles: Covers the progression from LeNet to AlexNet to VGG to ResNet, explaining how depth, receptive field, skip connections, and batch normalization each address specific optimization or generalization problems
Transfer learning as default methodology: Demonstrates that ImageNet-pretrained features transfer to most vision tasks, reducing the need for task-specific architecture engineering
Training recipes: Systematic coverage of learning rate schedules, data augmentation, dropout, weight decay, and hyperparameter search strategies that became standard practice
Generative models: Introduces VAEs, GANs, and (in later editions) diffusion models and Vision Transformers, connecting discriminative and generative paradigms

Architecture / Method

The course follows a bottom-up pedagogical structure. It begins with linear classifiers (k-NN, SVM, softmax) on image pixels, then introduces neural networks as stacked linear transformations with nonlinearities. Backpropagation is taught through the computational graph framework where each operation (add, multiply, max, etc.) has a local gradient, and the chain rule composes them.

Convolutional neural networks are introduced as neural networks with three key structural priors: local connectivity (each neuron connects to a small spatial region), weight sharing (the same filter is applied across all spatial positions), and spatial pooling (progressively reducing spatial resolution). The course traces the historical development from LeNet-5 through AlexNet (which introduced ReLU, dropout, GPU training), VGGNet (deep stacks of 3x3 filters), GoogLeNet (inception modules), and ResNet (skip connections).

Practical training methodology is covered extensively: batch normalization for stabilizing training, data augmentation (random crops, flips, color jitter) for regularization, learning rate warmup and decay schedules, and hyperparameter search strategies (random search over grid search). Transfer learning is presented as the default approach: initialize with ImageNet-pretrained weights, freeze early layers, fine-tune later layers and classification head on the target task.

Later modules cover recurrent neural networks (for sequences and captioning), attention mechanisms, generative adversarial networks, variational autoencoders, and (in recent editions) Vision Transformers and diffusion models.

Results

Neural networks are universal function approximators: The course grounds this in the universal approximation theorem and demonstrates empirically through assignments that even 2-layer networks can fit complex decision boundaries
Convolutional structure is a strong inductive bias for images: Weight sharing and local connectivity reduce parameters by orders of magnitude vs. fully-connected layers while improving generalization on spatial data, as demonstrated by CNN vs. MLP comparisons on CIFAR-10
Transfer learning outperforms training from scratch on small datasets: Fine-tuning an ImageNet-pretrained ResNet on a target dataset with ~1000 images beats training a CNN from scratch, shown in assignment experiments
Educational impact: Tens of thousands of students have taken the course or followed the open materials, making it a primary entry point for computer vision researchers worldwide

Limitations & Open Questions

The course is primarily focused on image classification; coverage of detection, segmentation, and video understanding is less deep
As a pedagogical resource rather than a research paper, it does not contain novel experimental results or theoretical contributions
Rapid pace of the field means specific architecture recommendations (e.g., VGG, GoogLeNet) become outdated, though the underlying principles remain relevant