What It Is
Multimodal embeddings are vector representations that place different modalities (image, text, audio, video) into a shared geometric space — where semantically similar concepts are nearby regardless of what modality they come from. “A dog” (text) and a photo of a dog (image) should map to nearby points in the shared space. The goal is a single coordinate system for meaning that transcends the raw signal type.
Why It Matters
A shared embedding space enables capabilities that modality-specific models can’t achieve: cross-modal retrieval (find images by text query), zero-shot classification on arbitrary categories, image-text matching without task-specific training, and multimodal reasoning where both text and image evidence combine in a unified representation. It’s the foundation of CLIP, ImageBind, and all modern vision-language models. Without it, every cross-modal task would require paired supervised data for the specific modality combination.
How It Works
Contrastive Alignment
The dominant approach: train separate encoders for each modality, project their outputs to the same dimension, and align them using contrastive loss. After training, dot product (or cosine similarity) between embeddings from different modalities measures semantic similarity.
Image: [photo of a dog] → Image Encoder (ViT) → projection head → v_image ∈ ℝ^512
Text: ["a dog"] → Text Encoder (BERT) → projection head → v_text ∈ ℝ^512
Loss: push (v_image, v_text) pairs close; push non-matching pairs apart
After training:
cosine_sim("a dog", photo_of_dog) ≈ 0.93 ← same concept
cosine_sim("a cat", photo_of_dog) ≈ 0.21 ← different concept
cosine_sim("a dog", photo_of_a_cat) ≈ 0.19 ← different concept
CLIP’s InfoNCE Loss
CLIP (Radford et al., 2021) trains on 400 million image-text pairs from the internet. For a batch of N pairs:
Where:
s(v, t)— cosine similarity between image embeddingvand text embeddingtτ— temperature parameter (learned); controls how sharply the loss discriminates- The two terms enforce alignment symmetrically: image→text and text→image
For a batch of N=32,768 (CLIP’s training batch size), each positive pair must be distinguished from 32,767 negative pairs. This forces the encoders to learn rich semantic structure — surface features aren’t sufficient to separate 32K candidates.
The Modality Gap Problem
Even after contrastive training, image and text embeddings don’t perfectly overlap — they cluster in different cones of the shared space. The “modality gap” means that a text embedding for “dog” and an image embedding for a dog are close, but not as close as two text embeddings for similar concepts. This gap persists because the encoders start from different initializations and learn different features.
Consequence: cross-modal retrieval is harder than same-modal retrieval, especially for fine-grained distinctions where modality-specific cues matter.
Beyond Two Modalities: ImageBind
ImageBind (Meta, 2023) aligns 6 modalities (image, text, audio, depth, thermal, IMU) by using image as a shared pivot. Train image-audio alignment, image-text alignment, image-depth alignment — and audio-text alignment emerges for free, even without ever training on audio-text pairs. Transitivity of the shared space enables zero-shot cross-modal transfer that was never explicitly trained.
What’s Clever
The non-obvious insight: you don’t need task-specific supervision. CLIP never trained on ImageNet classification, yet achieves 76.2% zero-shot accuracy matching supervised ResNet-50. The image-text pretraining implicitly taught the visual categories because the web-crawled captions named visual concepts. The “classifier” at test time is just the text embedding of the class name.
Second insight: scale overcomes noise. 400M pairs from the internet contain enormous noise (mismatched captions, low-quality images) — but the contrastive objective is self-supervised, so it doesn’t need clean labels. The noise is averaged away at scale.
The temperature parameter τ is critical: too high (soft) → all pairs look similar, no discriminative signal. Too low (sharp) → gradients vanish for most pairs. CLIP learns τ jointly, and its final value encodes how “peaky” the alignment space is.
Applications
- Zero-shot image classification: Compute cosine similarity between image embedding and text embeddings of class names. Take the argmax. No ImageNet training required.
- Cross-modal retrieval: Given a text query, find the closest image embeddings. Scales to billions of images (nearest-neighbor in embedding space).
- Multimodal generation: DALL-E and Stable Diffusion condition image generation on CLIP text embeddings.
- VLM fusion: In vision-language models (LLaVA, Flamingo), the image encoder produces patch embeddings that are directly fed into the language model as “visual tokens” alongside text.
Key Sources
- clip-learning-transferable-visual-models — CLIP; 400M image-text pairs, InfoNCE loss, 76.2% zero-shot ImageNet accuracy
Related Concepts
- contrastive-learning — the training objective that aligns the embedding spaces
- zero-shot-transfer — the downstream capability enabled by shared embedding spaces
- vision-transformer — the image encoder architecture used by CLIP and most modern VLMs
- patch-embeddings — how images are tokenized before being embedded
- early-fusion — alternative to shared embeddings; directly combine modalities in a unified architecture
Open Questions
- Can the modality gap be closed, or is it an inevitable consequence of different encoder architectures?
- How do multimodal embeddings handle modality-specific concepts (music with no visual correlate, text describing non-visual sensations)?
- Do larger contrastive models continue to improve on the same CLIP-style scaling curves?