Vision Transformer (ViT)

What It Is

A Vision Transformer (ViT) applies the Transformer architecture directly to images by splitting an image into fixed-size patches, embedding each patch as a token, and running self-attention over the sequence. No convolutions needed.

Why It Matters

Before ViT (2020), CNNs dominated computer vision. ViT showed that Transformers — with enough data — match or beat CNNs. It unified the architecture for vision and language, enabling multimodal models (CLIP, Flamingo, LLaVA) to use the same backbone for both modalities.

How It Works

Divide image into N non-overlapping patches (e.g. 16×16 pixels each).
Linearly project each patch to a d-dimensional embedding — this is the “token.”
Prepend a [CLS] token (classification summary).
Add learnable position embeddings (patches have no inherent order for the Transformer).
Run standard multi-head self-attention over the N+1 sequence.
Use [CLS] output for classification, or all patch outputs for dense tasks.

Image (224×224)
   │
   ▼ split into 14×14 = 196 patches (16×16 px each)
   │
   ▼ linear projection → 196 token embeddings
   │
[CLS] + patch_1 + patch_2 + ... + patch_196
   │
   ▼ Transformer encoder (standard)
   │
[CLS] output → classification head

What’s Clever

CNNs bake in locality and translation invariance as inductive biases. ViT removes those biases — attention can relate any patch to any other patch regardless of distance. This is worse with small datasets (CNN priors help), but better at large scale (data learns the structure CNNs assumed).

The patch embedding is essentially a learned dictionary of visual “words.” The attention mechanism then writes sentences over those words.

Key Sources

an-image-is-worth-16x16-words — original ViT paper
clip-learning-transferable-visual-models — CLIP uses a ViT as its visual encoder
segment-anything — SAM uses ViT-H as its image encoder (MAE-pretrained)
attention-is-all-you-need
dino-self-supervised-vision-transformers — DINO: self-supervised ViT with emergent segmentation properties
mae-masked-autoencoders-scalable-vision-learners — MAE: masked patch reconstruction with ViT
dit-scalable-diffusion-models-with-transformers — DiT: ViT as the backbone for latent diffusion models
metis-hdpo-meta-cognitive-tool-use
numina-counting-text-to-video
dinov2-learning-robust-visual-features
deep-residual-learning-for-image-recognition
simclr-contrastive-learning-visual-representations
blip-2-bootstrapping-language-image-pretraining — uses frozen ViT-L/14 and ViT-g/14 as the image encoder; Q-Former cross-attends to ViT patch embeddings
alphafold-2-protein-structure-prediction
emergent-world-representations-othello-gpt
qwen2-5-vl-technical-report
sam-2-segment-anything-in-images-and-videos
satmae-pretraining-transformers-temporal-multispectral-satellite-imagery

ML Wiki

Explorer

Vision Transformer (ViT)

What It Is

Why It Matters

How It Works

What’s Clever

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Vision Transformer (ViT)

What It Is

Why It Matters

How It Works

What’s Clever

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks