An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Citation

Dosovitskiy et al., ICLR, 2021.

Canonical link

Paper

Overview

The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences of image patches -- with no convolutions -- matches or exceeds state-of-the-art CNNs on image classification when pre-trained on large-scale data. This was the definitive demonstration that the Transformer architecture generalizes beyond language to vision.

ViT showed that the inductive biases hard-coded into CNNs (local connectivity, translation equivariance) are not necessary when sufficient data is available. By splitting images into 16x16 patches, linearly embedding them, and feeding the resulting sequence into a standard Transformer encoder, the paper achieved state-of-the-art ImageNet accuracy while using substantially less compute than the best CNNs. The critical finding was scale-dependent performance: ViT underperforms ResNets on small datasets but dominates at scale, revealing a data-compute tradeoff between inductive bias and flexibility.

This result triggered a paradigm shift. Within two years, vision transformers replaced CNNs as the dominant architecture in computer vision, and the patch-as-token idea enabled unified architectures across vision, language, and multimodal tasks (CLIP, DALL-E, PaLM-E). ViT is now the standard vision encoder in virtually all modern multimodal systems.

Key Contributions

Patches as tokens: Divide a 224x224 image into 196 non-overlapping 16x16 patches, flatten each to a 768-dim vector, and linearly project -- creating a sequence that a standard Transformer encoder processes without modification
Learnable [CLS] token and position embeddings: A prepended classification token aggregates global information; learnable 1D position embeddings encode spatial layout (shown to learn 2D structure automatically)
Scale-dependent performance: ViT underperforms ResNets when trained only on ImageNet-1k (~1.3M images) but dominates when pre-trained on ImageNet-21k (14M) or JFT-300M (300M), revealing that Transformers need more data to compensate for weaker inductive bias
Compute-efficient scaling: ViT-L/16 pre-trained on JFT-300M achieves 87.76% ImageNet top-1 accuracy using approximately 15x less compute than BiT-L (0.68k vs 9.9k TPUv3-core-days) at similar accuracy
Attention distance analysis: Some attention heads attend broadly across the image even in early layers, while others maintain local attention patterns; attention distance generally increases with network depth, suggesting the model learns a mix of local and global representations despite having no built-in locality bias

Architecture / Method

┌──────────────────────────────────────────────────────────────┐
│                Vision Transformer (ViT)                      │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Input Image (224 x 224)                                     │
│       │                                                      │
│       ▼                                                      │
│  Split into 196 patches (16 x 16 each)                       │
│  [p_1] [p_2] [p_3] ... [p_196]                              │
│       │                                                      │
│       ▼  Flatten + Linear Projection (to dim D)              │
│                                                              │
│  [CLS] [e_1] [e_2] [e_3] ... [e_196]                       │
│    +     +     +     +          +       Learnable Position   │
│  pos_0 pos_1 pos_2 pos_3 ... pos_196   Embeddings           │
│       │                                                      │
│       ▼                                                      │
│  ┌──────────────────────────────────┐  x L layers            │
│  │  Transformer Encoder Block       │  (L=12 Base,           │
│  │  ┌────────────────────────────┐  │   24 Large,            │
│  │  │  Layer Norm                 │  │   32 Huge)             │
│  │  │  Multi-Head Self-Attention  │  │                        │
│  │  │  + Residual                 │  │                        │
│  │  ├────────────────────────────┤  │                        │
│  │  │  Layer Norm                 │  │                        │
│  │  │  MLP (GELU)                │  │                        │
│  │  │  + Residual                 │  │                        │
│  │  └────────────────────────────┘  │                        │
│  └──────────────────────────────────┘                        │
│       │                                                      │
│       ▼ (CLS token output)                                   │
│  ┌──────────────┐                                           │
│  │ Linear Head   │ ──► Class Prediction                      │
│  └──────────────┘                                           │
└──────────────────────────────────────────────────────────────┘

Vision Transformer architecture overview

ViT uses a standard Transformer encoder (identical to the original Transformer's encoder from Vaswani et al.) with minimal modifications for vision. An input image of size H x W is divided into a grid of N = HW/P^2 non-overlapping patches of size P x P (typically P=16, giving N=196 patches for 224x224 images). Each patch is flattened into a vector and linearly projected to dimension D (768 for ViT-Base, 1024 for ViT-Large, 1280 for ViT-Huge).

A learnable [CLS] token is prepended to the patch sequence, and learnable 1D position embeddings are added to all tokens (including [CLS]). The resulting sequence of N+1 tokens is processed by L Transformer encoder layers, each consisting of multi-head self-attention (MSA), layer normalization (applied before attention and FFN, i.e., Pre-LN), and an MLP with GELU activation. The [CLS] token's output is passed through a classification head (a single linear layer during fine-tuning).

Three model sizes are studied: ViT-Base (12 layers, 12 heads, 86M params), ViT-Large (24 layers, 16 heads, 307M params), and ViT-Huge (32 layers, 16 heads, 632M params). Pre-training uses Adam with linear warmup and cosine decay on JFT-300M or ImageNet-21k. Fine-tuning on downstream tasks uses SGD with momentum at higher resolution (384x384 or 512x512), with position embeddings interpolated to handle the increased number of patches.

Results

Performance vs. pre-training dataset size showing ViT overtaking CNNs at scale

Scaling analysis: computational efficiency vs. performance trade-offs

Model	ImageNet Top-1	Pre-training Data	Compute (TPUv3-core-days)
ViT-H/14	88.55%	JFT-300M	2.5k
ViT-L/16	87.76%	JFT-300M	0.68k
BiT-L (ResNet)	87.54%	JFT-300M	9.9k
EfficientNet-L2	88.4%	JFT-300M	-

ViT matches SOTA at scale: ViT-H/14 achieves 88.55% top-1 on ImageNet and 77.63% on VTAB when pre-trained on JFT-300M, surpassing EfficientNet-L2 (88.4%) while using fewer TPU-days
Computational efficiency: ViT-L/16 achieved superior results using 0.68k TPUv3-core-days vs. BiT-L's 9.9k for comparable accuracy -- approximately 15x more compute-efficient
Pre-training dataset size is critical: On ImageNet-1k alone, ViT-B achieves only ~77% (vs. ResNet-50 at ~79%); on JFT-300M, ViT-L reaches 87.8%, demonstrating a crossover where Transformers overtake CNNs as data scales. Performance "rapidly improves" with larger datasets, eventually surpassing all competitors at JFT-300M scale
Position embeddings learn 2D structure: Cosine similarity between learned position embeddings reveals a clear 2D grid pattern, showing the model discovers spatial relationships from data alone despite using 1D position embeddings
Interpretability: Initial linear embedding filters resemble plausible basis functions for patches; some attention heads attend broadly across the image even in early layers while others maintain local attention, with attention distance generally increasing through network depth
No saturation at scale: Performance continues to improve as model size and data increase, with no sign of diminishing returns on JFT-300M, suggesting further gains from larger models
Transfer to multiple benchmarks: Strong results on CIFAR-10/100, Oxford Flowers, Oxford-IIIT Pets, and VTAB, establishing ViT as a general-purpose vision backbone
Central thesis: The work demonstrates that "large scale training trumps inductive bias" -- generic architectures can learn representations directly from data, potentially surpassing specialized architectures with better computational efficiency

Limitations & Open Questions

Requires very large pre-training datasets (14M+ images) to outperform CNNs; on small or medium datasets, ViT is inferior without additional regularization (addressed by DeiT, which introduces distillation and augmentation)
Fixed patch size creates a resolution-computation tradeoff: smaller patches yield more tokens and quadratically more attention cost; efficient attention variants (Swin Transformer) address this
The paper evaluates only classification; extending ViT to dense prediction tasks (detection, segmentation) required architectural modifications (ViTDet, Segmenter, Swin)

Connections

Machine Learning — foundational ML concepts
Attention Is All You Need — the Transformer architecture that ViT adapts for vision
Deep Residual Learning For Image Recognition — ResNet, the CNN baseline ViT compares against and eventually surpasses at scale
Imagenet Classification With Deep Convolutional Neural Networks — AlexNet, the CNN paradigm that ViT ultimately displaces
Learning Transferable Visual Models From Natural Language Supervision — CLIP, which uses ViT as its vision encoder for vision-language alignment
Bert Pre Training Of Deep Bidirectional Transformers For Language Understanding — BERT, whose [CLS] token and pre-training paradigm ViT directly adapts