Multimodal Embeddings

What It Is

Multimodal embeddings are vector representations that place different modalities (image, text, audio, video) into a shared geometric space — where semantically similar concepts are nearby regardless of what modality they come from. “A dog” (text) and a photo of a dog (image) should map to nearby points in the shared space. The goal is a single coordinate system for meaning that transcends the raw signal type.

Why It Matters

A shared embedding space enables capabilities that modality-specific models can’t achieve: cross-modal retrieval (find images by text query), zero-shot classification on arbitrary categories, image-text matching without task-specific training, and multimodal reasoning where both text and image evidence combine in a unified representation. It’s the foundation of CLIP, ImageBind, and all modern vision-language models. Without it, every cross-modal task would require paired supervised data for the specific modality combination.

How It Works

Contrastive Alignment

The dominant approach: train separate encoders for each modality, project their outputs to the same dimension, and align them using contrastive loss. After training, dot product (or cosine similarity) between embeddings from different modalities measures semantic similarity.

Image: [photo of a dog] → Image Encoder (ViT) → projection head → v_image ∈ ℝ^512
Text:  ["a dog"]        → Text Encoder (BERT)  → projection head → v_text  ∈ ℝ^512

Loss: push (v_image, v_text) pairs close; push non-matching pairs apart

After training:
  cosine_sim("a dog", photo_of_dog)     ≈ 0.93  ← same concept
  cosine_sim("a cat", photo_of_dog)     ≈ 0.21  ← different concept
  cosine_sim("a dog", photo_of_a_cat)   ≈ 0.19  ← different concept

CLIP’s InfoNCE Loss

CLIP (Radford et al., 2021) trains on 400 million image-text pairs from the internet. For a batch of N pairs:

$L = - \frac{1}{2 N} \sum_{i = 1}^{N} [lo g \frac{e ^{s (v_{i}, t_{i}) / τ}}{\sum _{j = 1}^{N} e ^{s (v_{i}, t_{j}) / τ}} + lo g \frac{e ^{s (v_{i}, t_{i}) / τ}}{\sum _{j = 1}^{N} e ^{s (v_{j}, t_{i}) / τ}}]$

Where:

s(v, t) — cosine similarity between image embedding v and text embedding t
τ — temperature parameter (learned); controls how sharply the loss discriminates
The two terms enforce alignment symmetrically: image→text and text→image

For a batch of N=32,768 (CLIP’s training batch size), each positive pair must be distinguished from 32,767 negative pairs. This forces the encoders to learn rich semantic structure — surface features aren’t sufficient to separate 32K candidates.

The Modality Gap Problem

Even after contrastive training, image and text embeddings don’t perfectly overlap — they cluster in different cones of the shared space. The “modality gap” means that a text embedding for “dog” and an image embedding for a dog are close, but not as close as two text embeddings for similar concepts. This gap persists because the encoders start from different initializations and learn different features.

Consequence: cross-modal retrieval is harder than same-modal retrieval, especially for fine-grained distinctions where modality-specific cues matter.

Beyond Two Modalities: ImageBind

ImageBind (Meta, 2023) aligns 6 modalities (image, text, audio, depth, thermal, IMU) by using image as a shared pivot. Train image-audio alignment, image-text alignment, image-depth alignment — and audio-text alignment emerges for free, even without ever training on audio-text pairs. Transitivity of the shared space enables zero-shot cross-modal transfer that was never explicitly trained.

What’s Clever

The non-obvious insight: you don’t need task-specific supervision. CLIP never trained on ImageNet classification, yet achieves 76.2% zero-shot accuracy matching supervised ResNet-50. The image-text pretraining implicitly taught the visual categories because the web-crawled captions named visual concepts. The “classifier” at test time is just the text embedding of the class name.

Second insight: scale overcomes noise. 400M pairs from the internet contain enormous noise (mismatched captions, low-quality images) — but the contrastive objective is self-supervised, so it doesn’t need clean labels. The noise is averaged away at scale.

The temperature parameter τ is critical: too high (soft) → all pairs look similar, no discriminative signal. Too low (sharp) → gradients vanish for most pairs. CLIP learns τ jointly, and its final value encodes how “peaky” the alignment space is.

Applications

Zero-shot image classification: Compute cosine similarity between image embedding and text embeddings of class names. Take the argmax. No ImageNet training required.
Cross-modal retrieval: Given a text query, find the closest image embeddings. Scales to billions of images (nearest-neighbor in embedding space).
Multimodal generation: DALL-E and Stable Diffusion condition image generation on CLIP text embeddings.
VLM fusion: In vision-language models (LLaVA, Flamingo), the image encoder produces patch embeddings that are directly fed into the language model as “visual tokens” alongside text.

Key Sources

clip-learning-transferable-visual-models — CLIP; 400M image-text pairs, InfoNCE loss, 76.2% zero-shot ImageNet accuracy
metis-hdpo-meta-cognitive-tool-use
numina-counting-text-to-video
llava-visual-instruction-tuning
segment-anything
whisper-robust-speech-recognition
gpt-4-technical-report
consensus-entropy-multi-vlm-agreement-ocr
gemini-1-5-multimodal-long-context
blip-2-bootstrapping-language-image-pretraining — Q-Former as information bottleneck: 32 learned queries compress 257 ViT patches into language-aligned visual tokens
flamingo-visual-language-model-few-shot-learning — Perceiver Resampler produces 64 visual token embeddings from variable-size feature maps; used as keys/values in gated cross-attention layers
bge-c-pack-general-chinese-embeddings
colbert-late-interaction-retrieval
llava-1-5-improved-baselines-with-visual-instruction-tuning
mteb-massive-text-embedding-benchmark
qwen2-5-vl-technical-report
sentence-bert-siamese-bert-networks
word2vec-efficient-estimation-word-representations

contrastive-learning — the training objective that aligns the embedding spaces
zero-shot-transfer — the downstream capability enabled by shared embedding spaces
vision-transformer — the image encoder architecture used by CLIP and most modern VLMs
patch-embeddings — how images are tokenized before being embedded
early-fusion — alternative to shared embeddings; directly combine modalities in a unified architecture

Open Questions

Can the modality gap be closed, or is it an inevitable consequence of different encoder architectures?
How do multimodal embeddings handle modality-specific concepts (music with no visual correlate, text describing non-visual sensations)?
Do larger contrastive models continue to improve on the same CLIP-style scaling curves?

ML Wiki

Explorer

Multimodal Embeddings

What It Is

Why It Matters

How It Works

Contrastive Alignment

CLIP’s InfoNCE Loss

The Modality Gap Problem

Beyond Two Modalities: ImageBind

What’s Clever

Applications

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Multimodal Embeddings

What It Is

Why It Matters

How It Works

Contrastive Alignment

CLIP’s InfoNCE Loss

The Modality Gap Problem

Beyond Two Modalities: ImageBind

What’s Clever

Applications

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks