Summary

ViT (Vision Transformer) demonstrates that a pure transformer architecture applied directly to sequences of image patches can match or exceed state-of-the-art CNNs on image classification — when pre-trained at sufficient scale. The key insight: divide a 224×224 image into 16×16 patches, flatten each patch, project to an embedding vector, and treat the resulting sequence exactly like word tokens. No convolutions, no local pooling — just standard transformer self-attention. The catch: ViT has no built-in vision inductive biases (locality, translation equivariance), so it needs large-scale pre-training (14M–300M images) to learn these from data. Pre-trained on JFT-300M, ViT-H/14 achieves 88.55% on ImageNet while using 4× less compute than the best ResNet baselines.

Key Claims

  • A standard Transformer applied to sequences of image patches achieves excellent results on image classification with less compute than CNNs, when pre-trained at scale
  • ViT underperforms CNNs when trained on small datasets (e.g. ImageNet-only, 1.3M images) due to lack of inductive bias
  • “Large scale training trumps inductive bias” — JFT-300M pre-training makes ViT competitive and eventually superior
  • ViT-H/14: 88.55% ImageNet, 94.55% CIFAR-100, 77.63% VTAB (19 tasks) — all SOTA at time of publication
  • ViT uses 2–4× less compute to attain the same performance as comparable ResNets on JFT-300M
  • Position embeddings do not need to be 2D-aware; 1D learnable embeddings suffice (verified empirically)

Methods

  1. Patch Embedding: Slice image into N non-overlapping P×P patches; flatten each to P²·C values; project to D dimensions via trainable linear layer
  2. CLS Token: Prepend a learnable classification token to the patch sequence (from BERT); its final-layer output serves as the image representation
  3. Position Embeddings: Add learnable 1D position embeddings to each token (including CLS); interpolated at higher fine-tuning resolutions
  4. Transformer Encoder: Standard alternating multi-head self-attention + MLP blocks with LayerNorm pre-normalization and residual connections
  5. Classification Head: MLP with one hidden layer during pre-training; single linear layer during fine-tuning
  6. Scale: ViT-Base (86M params, 12 layers), ViT-Large (307M, 24 layers), ViT-Huge (632M, 32 layers); patch sizes /16 or /14
  7. Pre-training: ImageNet-1k (1.3M), ImageNet-21k (14M), JFT-300M (300M images); Adam optimizer, batch 4096, high weight decay 0.1

Connections

Citation

arXiv:2010.11929

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.