Summary

CLIP trains two encoders — one for images, one for text — to produce vectors in a shared embedding space, using a contrastive objective on 400 million (image, text) pairs from the internet. The training task is simply “match each image to its caption” across a large batch. The resulting model performs zero-shot image classification on any task by encoding class names as text and finding the most similar image embedding — no fine-tuning needed. ViT-L/14 CLIP matches supervised ResNet-50 accuracy on ImageNet (76.2%) without using any of its 1.28M labeled training examples.

Key Claims

  • Contrastive pre-training on 400M internet (image, text) pairs learns transferable visual representations
  • Zero-shot transfer: create text prompts for class names, find highest cosine similarity to image embedding — no task-specific training
  • Matches ResNet-50 top-1 accuracy on ImageNet zero-shot (76.2%) with zero labeled examples
  • Prompt engineering matters: “a photo of a {label}” outperforms bare label strings
  • Generalizes across 30+ vision benchmarks spanning OCR, action recognition, geo-localization, fine-grained classification
  • Fails on fine-grained classification (FGVC Aircraft ~24%), counting, abstract reasoning, out-of-distribution domains

Methods

Architecture: Two encoders trained jointly. Image encoder: ResNet (modified) or Vision Transformer (ViT-B/32, ViT-L/14). Text encoder: Transformer with masked self-attention. Both produce L2-normalized embeddings.

Training objective: Symmetric InfoNCE / contrastive loss. For a batch of N pairs, compute N×N cosine similarity matrix S[i,j] = cosine(img_i, txt_j) × exp(τ). Maximize diagonal (correct pairs), minimize off-diagonal. Loss = (1/2)[CrossEntropy(S, labels) + CrossEntropy(Sᵀ, labels)]. Temperature τ is a learned scalar.

Scale: 400M (image, text) pairs (called WIT — WebImageText), batch size 32,768, trained from scratch. No ImageNet labels or explicit supervision.

Zero-shot inference: For K-class classification, encode K text prompts (“a photo of a {class}”) once. For each test image, compute cosine similarity with all K text vectors, predict highest. No gradient updates.

Connections

Citation

arXiv:2103.00020

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  booktitle={International Conference on Machine Learning},
  year={2021}
}