What It Is
Patch embeddings convert a 2D image into a sequence of fixed-size vector tokens that a Transformer can process exactly like word tokens. A 224×224 image is divided into non-overlapping 16×16 patches, each patch is flattened into a 768-dimensional vector, and a learned linear projection maps it to the model’s embedding dimension D. The result is a sequence of N = (224/16)² = 196 “patch tokens” — the visual equivalent of word embeddings. Introduced by ViT (Dosovitskiy et al., 2020) as the minimal bridge between images and standard NLP Transformers.
Why It Matters
Patch embeddings are the design choice that enabled NLP Transformers to operate on visual data without modification. Every architectural innovation in Transformer-based NLP — multi-head attention, positional encodings, residual connections, LayerNorm, pre-training at scale — transfers directly to vision when the input is represented as patch tokens. Without patch embeddings, applying NLP Transformers to images would require architectural surgery; with them, the architectures are identical.
This matters practically: ViT trained on JFT-300M outperforms ResNet-based models at scale, even without convolutions. The patch embedding design is why vision and language can share encoder architectures in modern VLMs (CLIP, LLaVA, Flamingo).
How It Works
Step-by-Step
-
Divide: Take a 224×224 RGB image. Divide into (224/16)² = 196 non-overlapping 16×16 pixel patches.
-
Flatten: Each patch is 16 × 16 × 3 = 768 pixel values. Flatten to a vector in ℝ^768.
-
Project: Multiply by a learned weight matrix E ∈ ℝ^(768 × D) to get a D-dimensional patch embedding, where D is the model’s embedding dimension (768 for ViT-Base, 1024 for ViT-Large).
-
Add position: Add a learnable positional embedding to inject spatial information (each of the 196 patch positions gets a unique learned vector).
-
Prepend CLS: Prepend a learnable [CLS] token whose final state is used for classification.
224×224 image
│
▼ divide into 16×16 patches
[p₁][p₂][p₃] ... [p₁₉₆] 196 patches
│
▼ flatten each patch
p₁ ∈ ℝ^768 (16×16×3 pixel values)
│
▼ linear projection: E ∈ ℝ^(768×D)
z₁ = p₁ · E ∈ ℝ^D
│
▼ add positional embedding
z₁ + pos_embed[1]
│
▼ prepend [CLS] token
[CLS, z₁, z₂, ..., z₁₉₆] → sequence of 197 tokens into Transformer
The Projection as a Convolution
The linear projection E can equivalently be implemented as a 2D convolution with kernel size 16, stride 16, and D output channels. This is how most implementations do it — nn.Conv2d(3, D, kernel_size=16, stride=16). The two formulations are mathematically identical, but the convolution notation makes clear that the patches don’t overlap (stride equals kernel size).
Patch Size Tradeoff
| Patch size | Sequence length (224px input) | Computation | Resolution of representation |
|---|---|---|---|
| 32×32 | 49 tokens | Fast | Coarse |
| 16×16 | 196 tokens | Moderate | Medium (standard ViT) |
| 8×8 | 784 tokens | Slow | Fine-grained |
| 4×4 | 3136 tokens | Very slow | Pixel-level |
Smaller patches: more tokens, finer spatial resolution, more compute (O(n²) attention). Larger patches: fewer tokens, faster, but coarser — misses fine-grained details needed for detection, segmentation, and dense prediction tasks.
No Convolution = No Locality Inductive Bias
A CNN’s convolution kernel processes nearby pixels together, baking in the assumption that local features matter. Patch embeddings throw this away — the Transformer attention layer can in principle attend to any patch from any other patch on the first layer. This is why ViT needs much more data than CNNs to work well: it must learn locality from data rather than having it built in.
ViT trained on ImageNet alone (1.28M images) underperforms comparable CNNs. Trained on JFT-300M (300M images), it outperforms them. The inductive bias trade-off: CNNs generalize from less data but hit a ceiling; ViT has lower data efficiency but higher ceiling at scale.
What’s Clever
The patch embedding is deliberately minimal: no convolutions, no pooling, no local receptive fields. Just flatten and project. This choice, which might seem naive, is what makes the architecture general: the same patch embedding applied to 16×16 image patches works identically whether the downstream task is ImageNet classification, COCO detection, or multimodal image-text alignment. There’s nothing image-specific in the architecture beyond the patch splitting.
The non-obvious consequence: ViT’s position embeddings learn 2D structure despite being 1D learned vectors. Visualization of learned position embeddings shows a clear grid structure — patches at nearby positions have similar embeddings. The model learned 2D spatial relationships from the task, not from architectural constraints.
Key Sources
- an-image-is-worth-16x16-words — introduces patch embeddings; ablations over patch size, position embedding variants, and dataset scale requirements
Related Concepts
- vision-transformer — the full model that uses patch embeddings as input
- classification-token — the [CLS] token prepended to the patch sequence
- attention — the mechanism that processes patch token sequences
- positional-encoding — position embeddings added to patch embeddings
- transfer-learning — ViT’s patch embeddings enable transfer to diverse vision tasks
- inductive-bias — lack of local inductive bias is patch embedding’s key trade-off
Open Questions
- What is the optimal patch size for variable-resolution inputs (e.g., high-resolution medical images)?
- Can overlapping patches (smaller stride) improve performance enough to justify the compute cost?
- Do hierarchical patch embeddings (Swin Transformer’s approach: smaller patches merged progressively) outperform flat 16×16 ViT at scale?