An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Summary

ViT (Vision Transformer) demonstrates that a pure transformer architecture applied directly to sequences of image patches can match or exceed state-of-the-art CNNs on image classification — when pre-trained at sufficient scale. The key insight: divide a 224×224 image into 16×16 patches, flatten each patch, project to an embedding vector, and treat the resulting sequence exactly like word tokens. No convolutions, no local pooling — just standard transformer self-attention. The catch: ViT has no built-in vision inductive biases (locality, translation equivariance), so it needs large-scale pre-training (14M–300M images) to learn these from data. Pre-trained on JFT-300M, ViT-H/14 achieves 88.55% on ImageNet while using 4× less compute than the best ResNet baselines.

Key Claims

A standard Transformer applied to sequences of image patches achieves excellent results on image classification with less compute than CNNs, when pre-trained at scale
ViT underperforms CNNs when trained on small datasets (e.g. ImageNet-only, 1.3M images) due to lack of inductive bias
“Large scale training trumps inductive bias” — JFT-300M pre-training makes ViT competitive and eventually superior
ViT-H/14: 88.55% ImageNet, 94.55% CIFAR-100, 77.63% VTAB (19 tasks) — all SOTA at time of publication
ViT uses 2–4× less compute to attain the same performance as comparable ResNets on JFT-300M
Position embeddings do not need to be 2D-aware; 1D learnable embeddings suffice (verified empirically)

Methods

Patch Embedding: Slice image into N non-overlapping P×P patches; flatten each to P²·C values; project to D dimensions via trainable linear layer
CLS Token: Prepend a learnable classification token to the patch sequence (from BERT); its final-layer output serves as the image representation
Position Embeddings: Add learnable 1D position embeddings to each token (including CLS); interpolated at higher fine-tuning resolutions
Transformer Encoder: Standard alternating multi-head self-attention + MLP blocks with LayerNorm pre-normalization and residual connections
Classification Head: MLP with one hidden layer during pre-training; single linear layer during fine-tuning
Scale: ViT-Base (86M params, 12 layers), ViT-Large (307M, 24 layers), ViT-Huge (632M, 32 layers); patch sizes /16 or /14
Pre-training: ImageNet-1k (1.3M), ImageNet-21k (14M), JFT-300M (300M images); Adam optimizer, batch 4096, high weight decay 0.1

Connections

attention-is-all-you-need — identical transformer architecture, no modifications
bert-pre-training-of-deep-bidirectional-transformers — CLS token design borrowed directly
clip-learning-transferable-visual-models — CLIP’s image encoder is a ViT; ViT enabled CLIP
vision-transformer — the architecture this paper introduces
patch-embeddings — how images are sliced and projected into token sequences
attention — self-attention is the core computation replacing all convolutions
transformer — standard encoder architecture applied directly to image patches
transfer-learning — pre-train at large scale, fine-tune on downstream tasks
classification-token — CLS token prepended to patch sequence, borrowed from BERT
inductive-bias — ViT deliberately removes CNN inductive biases (locality, equivariance)

Citation

arXiv:2010.11929

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.

ML Wiki

Explorer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Summary

Key Claims

Methods

Connections

Citation

Graph View

Table of Contents

Backlinks