What It Is

A Vision Transformer (ViT) applies the Transformer architecture directly to images by splitting an image into fixed-size patches, embedding each patch as a token, and running self-attention over the sequence. No convolutions needed.

Why It Matters

Before ViT (2020), CNNs dominated computer vision. ViT showed that Transformers — with enough data — match or beat CNNs. It unified the architecture for vision and language, enabling multimodal models (CLIP, Flamingo, LLaVA) to use the same backbone for both modalities.

How It Works

  1. Divide image into N non-overlapping patches (e.g. 16×16 pixels each).
  2. Linearly project each patch to a d-dimensional embedding — this is the “token.”
  3. Prepend a [CLS] token (classification summary).
  4. Add learnable position embeddings (patches have no inherent order for the Transformer).
  5. Run standard multi-head self-attention over the N+1 sequence.
  6. Use [CLS] output for classification, or all patch outputs for dense tasks.
Image (224×224)
   │
   ▼ split into 14×14 = 196 patches (16×16 px each)
   │
   ▼ linear projection → 196 token embeddings
   │
[CLS] + patch_1 + patch_2 + ... + patch_196
   │
   ▼ Transformer encoder (standard)
   │
[CLS] output → classification head

What’s Clever

CNNs bake in locality and translation invariance as inductive biases. ViT removes those biases — attention can relate any patch to any other patch regardless of distance. This is worse with small datasets (CNN priors help), but better at large scale (data learns the structure CNNs assumed).

The patch embedding is essentially a learned dictionary of visual “words.” The attention mechanism then writes sentences over those words.

Key Sources