Summary
ViT (Vision Transformer) demonstrates that a pure transformer architecture applied directly to sequences of image patches can match or exceed state-of-the-art CNNs on image classification — when pre-trained at sufficient scale. The key insight: divide a 224×224 image into 16×16 patches, flatten each patch, project to an embedding vector, and treat the resulting sequence exactly like word tokens. No convolutions, no local pooling — just standard transformer self-attention. The catch: ViT has no built-in vision inductive biases (locality, translation equivariance), so it needs large-scale pre-training (14M–300M images) to learn these from data. Pre-trained on JFT-300M, ViT-H/14 achieves 88.55% on ImageNet while using 4× less compute than the best ResNet baselines.
Key Claims
- A standard Transformer applied to sequences of image patches achieves excellent results on image classification with less compute than CNNs, when pre-trained at scale
- ViT underperforms CNNs when trained on small datasets (e.g. ImageNet-only, 1.3M images) due to lack of inductive bias
- “Large scale training trumps inductive bias” — JFT-300M pre-training makes ViT competitive and eventually superior
- ViT-H/14: 88.55% ImageNet, 94.55% CIFAR-100, 77.63% VTAB (19 tasks) — all SOTA at time of publication
- ViT uses 2–4× less compute to attain the same performance as comparable ResNets on JFT-300M
- Position embeddings do not need to be 2D-aware; 1D learnable embeddings suffice (verified empirically)
Methods
- Patch Embedding: Slice image into N non-overlapping P×P patches; flatten each to P²·C values; project to D dimensions via trainable linear layer
- CLS Token: Prepend a learnable classification token to the patch sequence (from BERT); its final-layer output serves as the image representation
- Position Embeddings: Add learnable 1D position embeddings to each token (including CLS); interpolated at higher fine-tuning resolutions
- Transformer Encoder: Standard alternating multi-head self-attention + MLP blocks with LayerNorm pre-normalization and residual connections
- Classification Head: MLP with one hidden layer during pre-training; single linear layer during fine-tuning
- Scale: ViT-Base (86M params, 12 layers), ViT-Large (307M, 24 layers), ViT-Huge (632M, 32 layers); patch sizes /16 or /14
- Pre-training: ImageNet-1k (1.3M), ImageNet-21k (14M), JFT-300M (300M images); Adam optimizer, batch 4096, high weight decay 0.1
Connections
- attention-is-all-you-need — identical transformer architecture, no modifications
- bert-pre-training-of-deep-bidirectional-transformers — CLS token design borrowed directly
- clip-learning-transferable-visual-models — CLIP’s image encoder is a ViT; ViT enabled CLIP
- vision-transformer — the architecture this paper introduces
- patch-embeddings — how images are sliced and projected into token sequences
- attention — self-attention is the core computation replacing all convolutions
- transformer — standard encoder architecture applied directly to image patches
- transfer-learning — pre-train at large scale, fine-tune on downstream tasks
- classification-token — CLS token prepended to patch sequence, borrowed from BERT
- inductive-bias — ViT deliberately removes CNN inductive biases (locality, equivariance)
Citation
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.