What It Is

Contrastive learning trains a model by pulling representations of similar pairs together and pushing dissimilar pairs apart — without requiring labels. The model learns what makes things alike by learning what makes them different.

Why It Matters

It’s how you get powerful visual representations (and multimodal representations) from unlabeled data. CLIP, SimCLR, MoCo, and DINO are all contrastive. The representations generalize remarkably well to downstream tasks without any task-specific fine-tuning.

How It Works

Given a batch of N image-text pairs, the contrastive loss (InfoNCE) works as follows:

For each image i, encode it to v_i. For each paired text i, encode it to t_i. The model should score (v_i, t_i) high and (v_i, t_j) for j≠i low.

Images:  v_1  v_2  v_3  ...  v_N
Texts:   t_1  t_2  t_3  ...  t_N

Similarity matrix S[i,j] = v_i · t_j

Goal: diagonal entries (matched pairs) >> off-diagonal (mismatches)

Loss = CrossEntropy(S, identity matrix)  ← rows and columns separately

The off-diagonal entries in a batch of N=4096 are the “negatives.” Large batches = harder negatives = better representations. CLIP used 32,768 pairs per batch.

What’s Clever

The label is implicit in the pairing. You don’t need a human to say “this is a cat” — you just need naturally co-occurring pairs (image + caption, two augmented views of the same image). The internet provides millions of these for free.

Key Sources