Patch Embeddings

What It Is

Patch embeddings convert a 2D image into a sequence of fixed-size vector tokens that a Transformer can process exactly like word tokens. A 224×224 image is divided into non-overlapping 16×16 patches, each patch is flattened into a 768-dimensional vector, and a learned linear projection maps it to the model’s embedding dimension D. The result is a sequence of N = (224/16)² = 196 “patch tokens” — the visual equivalent of word embeddings. Introduced by ViT (Dosovitskiy et al., 2020) as the minimal bridge between images and standard NLP Transformers.

Why It Matters

Patch embeddings are the design choice that enabled NLP Transformers to operate on visual data without modification. Every architectural innovation in Transformer-based NLP — multi-head attention, positional encodings, residual connections, LayerNorm, pre-training at scale — transfers directly to vision when the input is represented as patch tokens. Without patch embeddings, applying NLP Transformers to images would require architectural surgery; with them, the architectures are identical.

This matters practically: ViT trained on JFT-300M outperforms ResNet-based models at scale, even without convolutions. The patch embedding design is why vision and language can share encoder architectures in modern VLMs (CLIP, LLaVA, Flamingo).

How It Works

Step-by-Step

Divide: Take a 224×224 RGB image. Divide into (224/16)² = 196 non-overlapping 16×16 pixel patches.
Flatten: Each patch is 16 × 16 × 3 = 768 pixel values. Flatten to a vector in ℝ^768.
Project: Multiply by a learned weight matrix E ∈ ℝ^(768 × D) to get a D-dimensional patch embedding, where D is the model’s embedding dimension (768 for ViT-Base, 1024 for ViT-Large).
Add position: Add a learnable positional embedding to inject spatial information (each of the 196 patch positions gets a unique learned vector).
Prepend CLS: Prepend a learnable [CLS] token whose final state is used for classification.

224×224 image
     │
     ▼ divide into 16×16 patches
[p₁][p₂][p₃] ... [p₁₉₆]    196 patches
     │
     ▼ flatten each patch
p₁ ∈ ℝ^768   (16×16×3 pixel values)
     │
     ▼ linear projection: E ∈ ℝ^(768×D)
z₁ = p₁ · E ∈ ℝ^D
     │
     ▼ add positional embedding
z₁ + pos_embed[1]
     │
     ▼ prepend [CLS] token
[CLS, z₁, z₂, ..., z₁₉₆]   → sequence of 197 tokens into Transformer

The Projection as a Convolution

The linear projection E can equivalently be implemented as a 2D convolution with kernel size 16, stride 16, and D output channels. This is how most implementations do it — nn.Conv2d(3, D, kernel_size=16, stride=16). The two formulations are mathematically identical, but the convolution notation makes clear that the patches don’t overlap (stride equals kernel size).

Patch Size Tradeoff

Patch size	Sequence length (224px input)	Computation	Resolution of representation
32×32	49 tokens	Fast	Coarse
16×16	196 tokens	Moderate	Medium (standard ViT)
8×8	784 tokens	Slow	Fine-grained
4×4	3136 tokens	Very slow	Pixel-level

Smaller patches: more tokens, finer spatial resolution, more compute (O(n²) attention). Larger patches: fewer tokens, faster, but coarser — misses fine-grained details needed for detection, segmentation, and dense prediction tasks.

No Convolution = No Locality Inductive Bias

A CNN’s convolution kernel processes nearby pixels together, baking in the assumption that local features matter. Patch embeddings throw this away — the Transformer attention layer can in principle attend to any patch from any other patch on the first layer. This is why ViT needs much more data than CNNs to work well: it must learn locality from data rather than having it built in.

ViT trained on ImageNet alone (1.28M images) underperforms comparable CNNs. Trained on JFT-300M (300M images), it outperforms them. The inductive bias trade-off: CNNs generalize from less data but hit a ceiling; ViT has lower data efficiency but higher ceiling at scale.

What’s Clever

The patch embedding is deliberately minimal: no convolutions, no pooling, no local receptive fields. Just flatten and project. This choice, which might seem naive, is what makes the architecture general: the same patch embedding applied to 16×16 image patches works identically whether the downstream task is ImageNet classification, COCO detection, or multimodal image-text alignment. There’s nothing image-specific in the architecture beyond the patch splitting.

The non-obvious consequence: ViT’s position embeddings learn 2D structure despite being 1D learned vectors. Visualization of learned position embeddings shows a clear grid structure — patches at nearby positions have similar embeddings. The model learned 2D spatial relationships from the task, not from architectural constraints.

Key Sources

an-image-is-worth-16x16-words — introduces patch embeddings; ablations over patch size, position embedding variants, and dataset scale requirements
dit-scalable-diffusion-models-with-transformers — DiT: patch embeddings applied to VAE latents, where patch size directly trades compute for FID
mae-masked-autoencoders-scalable-vision-learners
dinov2-learning-robust-visual-features — DINOv2 uses ViT/14 (14×14px patches) showing smaller patches yield denser, higher-quality features
bge-c-pack-general-chinese-embeddings
colbert-late-interaction-retrieval
mteb-massive-text-embedding-benchmark
sentence-bert-siamese-bert-networks
word2vec-efficient-estimation-word-representations

vision-transformer — the full model that uses patch embeddings as input
classification-token — the [CLS] token prepended to the patch sequence
attention — the mechanism that processes patch token sequences
positional-encoding — position embeddings added to patch embeddings
transfer-learning — ViT’s patch embeddings enable transfer to diverse vision tasks
inductive-bias — lack of local inductive bias is patch embedding’s key trade-off

Open Questions

What is the optimal patch size for variable-resolution inputs (e.g., high-resolution medical images)?
Can overlapping patches (smaller stride) improve performance enough to justify the compute cost?
Do hierarchical patch embeddings (Swin Transformer’s approach: smaller patches merged progressively) outperform flat 16×16 ViT at scale?

ML Wiki

Explorer

Patch Embeddings

What It Is

Why It Matters

How It Works

Step-by-Step

The Projection as a Convolution

Patch Size Tradeoff

No Convolution = No Locality Inductive Bias

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Patch Embeddings

What It Is

Why It Matters

How It Works

Step-by-Step

The Projection as a Convolution

Patch Size Tradeoff

No Convolution = No Locality Inductive Bias

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks