CLIP: Learning Transferable Visual Models From Natural Language Supervision

Summary

CLIP trains two encoders — one for images, one for text — to produce vectors in a shared embedding space, using a contrastive objective on 400 million (image, text) pairs from the internet. The training task is simply “match each image to its caption” across a large batch. The resulting model performs zero-shot image classification on any task by encoding class names as text and finding the most similar image embedding — no fine-tuning needed. ViT-L/14 CLIP matches supervised ResNet-50 accuracy on ImageNet (76.2%) without using any of its 1.28M labeled training examples.

Key Claims

Contrastive pre-training on 400M internet (image, text) pairs learns transferable visual representations
Zero-shot transfer: create text prompts for class names, find highest cosine similarity to image embedding — no task-specific training
Matches ResNet-50 top-1 accuracy on ImageNet zero-shot (76.2%) with zero labeled examples
Prompt engineering matters: “a photo of a {label}” outperforms bare label strings
Generalizes across 30+ vision benchmarks spanning OCR, action recognition, geo-localization, fine-grained classification
Fails on fine-grained classification (FGVC Aircraft ~24%), counting, abstract reasoning, out-of-distribution domains

Methods

Architecture: Two encoders trained jointly. Image encoder: ResNet (modified) or Vision Transformer (ViT-B/32, ViT-L/14). Text encoder: Transformer with masked self-attention. Both produce L2-normalized embeddings.

Training objective: Symmetric InfoNCE / contrastive loss. For a batch of N pairs, compute N×N cosine similarity matrix S[i,j] = cosine(img_i, txt_j) × exp(τ). Maximize diagonal (correct pairs), minimize off-diagonal. Loss = (1/2)[CrossEntropy(S, labels) + CrossEntropy(Sᵀ, labels)]. Temperature τ is a learned scalar.

Scale: 400M (image, text) pairs (called WIT — WebImageText), batch size 32,768, trained from scratch. No ImageNet labels or explicit supervision.

Zero-shot inference: For K-class classification, encode K text prompts (“a photo of a {class}”) once. For each test image, compute cosine similarity with all K text vectors, predict highest. No gradient updates.

Connections

attention-is-all-you-need — both the image encoder (ViT) and text encoder use Transformer self-attention
emergent-abilities-of-large-language-models — CLIP’s zero-shot generalization is an emergent property that appears with scale
contrastive-learning — InfoNCE loss over (image, text) pairs is the core training objective
zero-shot-transfer — classify any label set at inference time without task-specific training
multimodal-embeddings — shared image-text embedding space is the central contribution
transfer-learning — representations learned on 400M web pairs transfer to 30+ vision benchmarks
vision-transformer — ViT-L/14 is the strongest CLIP image encoder variant

Citation

arXiv:2103.00020

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  booktitle={International Conference on Machine Learning},
  year={2021}
}

ML Wiki

Explorer

CLIP: Learning Transferable Visual Models From Natural Language Supervision

Summary

Key Claims

Methods

Connections

Citation

Graph View

Table of Contents

Backlinks