From Pixels to Understanding — Vision-Language Models

This path follows the development of vision-language understanding: from the convolutional era that defined computer vision for a decade, through the self-supervised revolution, to the transformer architectures that unified vision and language into shared representation spaces.

Step 1 — ResNet: Deep Residual Learning

deep-residual-learning-for-image-recognition

Start here, not with transformers. Before Vision Transformers, ResNet (2015) was the backbone of computer vision. Its key insight — residual connections that let gradients flow through very deep networks — made training 50, 101, 152-layer networks stable. ResNets won ImageNet by a large margin and became the standard feature extractor for a decade of downstream tasks. Understanding what CNNs do well (local feature hierarchies, translation equivariance) and where they fall short motivates why the field moved to transformers.

Step 2 — Vision Transformer (ViT)

vision-transformer

ViT’s claim: you don’t need convolutions. Split an image into 16×16 patches, linearly project each to a token embedding, and run a standard transformer. On large-enough datasets, this outperforms ConvNets with no architectural modifications. The critical insight for this path: patch embeddings are structurally identical to text token embeddings, which is what makes multimodal models possible — images and text can be processed in the same sequence.

Step 3 — MAE: Masked Autoencoders

mae-masked-autoencoders-scalable-vision-learners

ViT trained on labeled ImageNet requires a lot of labels. MAE removes the label dependence: mask 75% of image patches at random and train the model to reconstruct them. This is BERT’s masked language modeling applied to vision — but with a key difference. Pixels are much lower information density than words, so masking must be aggressive (75% vs BERT’s 15%) to create a non-trivial task. MAE produces strong representations that transfer well to downstream tasks without any labels.

Step 4 — Contrastive Learning

contrastive-learning

MAE learns by reconstruction. Contrastive learning takes a different route: pull semantically similar pairs together and push dissimilar pairs apart in embedding space. The key insight is that pairing (image, augmentation of same image) provides free supervision — no labels needed. SimCLR showed this works at scale; CLIP extended it to (image, caption) pairs from the internet. Understanding contrastive objectives is prerequisite to understanding CLIP.

Step 5 — SimCLR: Contrastive Learning of Visual Representations

simclr-contrastive-learning-visual-representations

SimCLR is the clean, principled implementation of contrastive learning for vision. Take an image, apply two different augmentations (crop, color jitter, blur), and train a network to embed both augmentations close together while pushing all other images in the batch apart. The paper systematically ablates what matters: the projection head (critical), batch size (larger is better), augmentation strength (color matters most). SimCLR’s clarity made it the reference for understanding why contrastive self-supervision works before studying CLIP.

Step 6 — DINO: Self-Supervised Vision Transformers

dino-self-supervised-vision-transformers

DINO combines ViT with self-supervised learning using a self-distillation objective: a student network is trained to match the outputs of a momentum-updated teacher network, with no contrastive loss and no negative pairs. The surprising result: DINO features produce sharp semantic segmentation without any segmentation supervision — attend to the right noun and the attention maps highlight the object. This emergent localization property is distinctively stronger in ViT than in CNN-based self-supervised models.

Step 7 — DINOv2: Learning Robust Visual Features

dinov2-learning-robust-visual-features

DINOv2 scales and systematizes DINO’s approach. The key additions: a curated large-scale dataset (LVD-142M images automatically filtered for quality and diversity), distillation from a larger teacher, and combining the self-distillation loss with an instance discrimination term. The result is a frozen backbone that, without any fine-tuning, achieves state-of-the-art performance on depth estimation, semantic segmentation, and classification. DINOv2 features have become the default visual backbone for many multimodal systems.

Step 8 — CLIP

clip-learning-transferable-visual-models

CLIP applies contrastive learning at scale to (image, text) pairs — 400M pairs scraped from the internet — training a vision encoder and text encoder jointly with InfoNCE loss. The result is a shared embedding space where images and their descriptions land near each other. CLIP’s representations generalize radically: trained only on web image-text pairs, CLIP achieves ImageNet accuracy comparable to supervised ResNets without ever seeing ImageNet. This shared space is the foundation of subsequent vision-language systems.

Step 9 — Multimodal Embeddings

multimodal-embeddings

CLIP produces a shared space — but using it well requires understanding what multimodal embeddings are and what properties they have. A multimodal embedding maps inputs from different modalities (image, text, audio) into a common vector space where similarity is meaningful across modalities. This step connects the CLIP training story to the downstream retrieval and generation applications that consume these embeddings.

Step 10 — Zero-Shot Transfer

zero-shot-transfer

The payoff of shared embeddings: zero-shot classification without any task-specific training data. Given class names as text, encode them and compare against image embeddings — the closest class name is the prediction. CLIP achieves ImageNet accuracy comparable to a supervised ResNet-50, without ever seeing ImageNet training images. This step explains why the shared embedding space is so powerful and what its limits are.

Step 11 — Early Fusion

early-fusion

Zero-shot classification treats vision and language as separate towers that meet at the end. Early fusion goes further: interleave image tokens and text tokens in the same transformer, letting attention patterns span modalities from the first layer. This is how modern vision-language models (LLaVA, Flamingo, GPT-4V) handle tasks requiring joint understanding — answering questions about images, image captioning, and visual reasoning — rather than just matching.

ML Wiki

Explorer