Contrastive Learning

What It Is

Contrastive learning trains a model by pulling representations of similar pairs together and pushing dissimilar pairs apart — without requiring labels. The model learns what makes things alike by learning what makes them different.

Why It Matters

It’s how you get powerful visual representations (and multimodal representations) from unlabeled data. CLIP, SimCLR, MoCo, and DINO are all contrastive. The representations generalize remarkably well to downstream tasks without any task-specific fine-tuning.

How It Works

Given a batch of N image-text pairs, the contrastive loss (InfoNCE) works as follows:

For each image i, encode it to v_i. For each paired text i, encode it to t_i. The model should score (v_i, t_i) high and (v_i, t_j) for j≠i low.

Images:  v_1  v_2  v_3  ...  v_N
Texts:   t_1  t_2  t_3  ...  t_N

Similarity matrix S[i,j] = v_i · t_j

Goal: diagonal entries (matched pairs) >> off-diagonal (mismatches)

Loss = CrossEntropy(S, identity matrix)  ← rows and columns separately

The off-diagonal entries in a batch of N=4096 are the “negatives.” Large batches = harder negatives = better representations. CLIP used 32,768 pairs per batch.

What’s Clever

The label is implicit in the pairing. You don’t need a human to say “this is a cat” — you just need naturally co-occurring pairs (image + caption, two augmented views of the same image). The internet provides millions of these for free.

Key Sources

clip-learning-transferable-visual-models — the defining application of contrastive learning to vision-language alignment
dino-self-supervised-vision-transformers
simclr-contrastive-learning-visual-representations
blip-2-bootstrapping-language-image-pretraining — uses ITC (image-text contrastive) as Stage 1 objective to align Q-Former query outputs with text embeddings
bge-c-pack-general-chinese-embeddings
sentence-bert-siamese-bert-networks

vision-transformer
zero-shot-transfer
in-context-learning
multimodal-embeddings — contrastive training on image-text pairs produces multimodal embedding spaces like CLIP’s

ML Wiki

Explorer

Contrastive Learning

What It Is

Why It Matters

How It Works

What’s Clever

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Contrastive Learning

What It Is

Why It Matters

How It Works

What’s Clever

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks