What It Is
Contrastive learning trains a model by pulling representations of similar pairs together and pushing dissimilar pairs apart — without requiring labels. The model learns what makes things alike by learning what makes them different.
Why It Matters
It’s how you get powerful visual representations (and multimodal representations) from unlabeled data. CLIP, SimCLR, MoCo, and DINO are all contrastive. The representations generalize remarkably well to downstream tasks without any task-specific fine-tuning.
How It Works
Given a batch of N image-text pairs, the contrastive loss (InfoNCE) works as follows:
For each image i, encode it to v_i. For each paired text i, encode it to t_i. The model should score (v_i, t_i) high and (v_i, t_j) for j≠i low.
Images: v_1 v_2 v_3 ... v_N
Texts: t_1 t_2 t_3 ... t_N
Similarity matrix S[i,j] = v_i · t_j
Goal: diagonal entries (matched pairs) >> off-diagonal (mismatches)
Loss = CrossEntropy(S, identity matrix) ← rows and columns separately
The off-diagonal entries in a batch of N=4096 are the “negatives.” Large batches = harder negatives = better representations. CLIP used 32,768 pairs per batch.
What’s Clever
The label is implicit in the pairing. You don’t need a human to say “this is a cat” — you just need naturally co-occurring pairs (image + caption, two augmented views of the same image). The internet provides millions of these for free.
Key Sources
- clip-learning-transferable-visual-models — the defining application of contrastive learning to vision-language alignment
Related Concepts
- vision-transformer
- zero-shot-transfer
- in-context-learning
- multimodal-embeddings — contrastive training on image-text pairs produces multimodal embedding spaces like CLIP’s