Concepts: multimodal-embeddings | vision-language-models | contrastive-learning | cross-attention | vision-transformer Builds on: clip-learning-transferable-visual-models | an-image-is-worth-16x16-words Leads to: llava-visual-instruction-tuning

The problem

Training a vision-language model from scratch is expensive. Flamingo — DeepMind’s 2022 flagship — required 80B total parameters and training 10.2B of them on billions of image-text pairs scraped from the web. Every time a better image encoder appears, or a better LLM appears, you throw that investment away and start over.

The frustrating thing is that excellent frozen components already exist: CLIP’s ViT can describe images with remarkable fidelity, and GPT-class LLMs reason over text with remarkable power. The problem isn’t the parts. It’s the gap between them. Images speak one language. LLMs speak another. Nobody had found a cheap way to translate.

The core idea

Let’s find the right analogy first.

Imagine a United Nations negotiation. The French diplomat speaks only French. The Japanese ambassador speaks only Japanese. You need them to communicate, but you can’t restructure either delegation — they’re too important, too set in their ways. So you hire an interpreter: fluent in both languages, able to translate concepts across the gap. The interpreter is small, cheap to train, and entirely replaceable if a better one comes along.

BLIP-2’s Q-Former is that interpreter.

The insight is disarmingly simple: freeze both the image encoder and the LLM. Train only a small bridge — 188M parameters — to translate between them. You don’t need to teach the ViT to speak LLM, and you don’t need to teach the LLM to see. You just need a bridge that extracts visual information in a form the LLM already understands.

Here’s the mechanism, step by step:

Q-Former architecture: a compact transformer initialized from BERT-base. It holds 32 learnable query vectors — trainable embeddings that start random and learn to represent visual concepts over training. These queries attend to the frozen ViT’s patch embeddings via cross-attention (inserted every other block), and attend to each other and to text via shared self-attention layers.

“Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text.”

Stage 1 — representation learning: The Q-Former is connected to the frozen ViT and trained on 129M image-text pairs with three simultaneous objectives:

  • ITC (Image-Text Contrastive): align query outputs with text representations, pushing matched pairs close and mismatched pairs apart — just like CLIP, but now operating over the compressed 32-query representation instead of a global image embedding
  • ITG (Image-grounded Text Generation): given the image, generate the matching caption — forcing the queries to encode all information needed to produce text
  • ITM (Image-Text Matching): binary classification — does this image match this text? — training fine-grained alignment

Stage 2 — generative learning: Connect the trained Q-Former to a frozen LLM via a single fully-connected projection layer. The 32 query outputs are projected into the LLM’s embedding dimension and prepended to the text input sequence as 32 “soft visual prompt tokens.” The LLM sees a sequence starting with 32 opaque vectors followed by normal text.

STAGE 1: Representation Learning

  Frozen ViT                  Q-Former (188M, TRAINABLE)
  ┌─────────────────┐         ┌──────────────────────────┐
  │ 224×224 image   │         │  32 learnable queries    │
  │      ↓          │         │  (32 × 768 floats)       │
  │  ViT-L/14       │         │                          │
  │  257 patches ───┼────────→│  cross-attn to patches   │
  │  (257 × 1024)   │         │  self-attn to each other │
  └─────────────────┘         │  self-attn to text       │
                              │          ↓               │
                              │   Z: 32 × 768 outputs    │
                              └──────────┬───────────────┘
                                         │
                         ┌───────────────┼───────────────┐
                         ↓               ↓               ↓
                       [ITC]           [ITG]           [ITM]
                    align with       generate        match/no
                    text embed       caption         match?

STAGE 2: Generative Learning

  Frozen ViT → Q-Former → FC Projection → Frozen LLM
                                  ↓
                32 × 768 → 32 × d_LLM (e.g. 2560 for OPT-2.7B)
                                  ↓
               ["<vis_0> <vis_1> ... <vis_31> Question: What breed is this? Answer:"]
                └──────────────────┘ └───────────────────────────────────────────────┘
                  soft visual prompts             normal text tokens
                                  ↓
                              LLM generates: "Golden Retriever"

The math, translated:

Stage 1 ITC loss: for each query output (one of 32 vectors) and text CLS embedding , compute cosine similarity. The model takes the maximum similarity across all 32 queries as the image-text similarity score, then applies InfoNCE:

The max over queries means the query most relevant to the text is responsible for the contrastive signal — encouraging specialization across the 32 slots.

Stage 2 prepending: if the LLM uses embedding dimension , the projection is where . The LLM receives input — the 32 projected query outputs concatenated with normal text token embeddings.

Walkthrough with actual numbers:

Input: a 224×224 image fed to ViT-L/14 (patch size 14).

Step 1: ViT patch extraction
  224 / 14 = 16 → 16 × 16 = 256 patches + 1 CLS token = 257 total
  Each patch embedding: 1024 floats
  Total visual features: 257 × 1024 = 263,168 floats

Step 2: Q-Former compression
  32 query vectors, each 768 floats
  Total after Q-Former: 32 × 768 = 24,576 floats
  Compression ratio: 263,168 / 24,576 ≈ 10.7×

Step 3: Stage 1 ITC similarity (batch of 4, τ=0.07)
  Query-text similarities (max across 32 queries):
  S = [[0.85, 0.21, 0.18, 0.23],   ← image 1 aligns with text 1
       [0.19, 0.82, 0.20, 0.15],   ← image 2 aligns with text 2
       [0.22, 0.17, 0.79, 0.24],   ← image 3 aligns with text 3
       [0.20, 0.23, 0.16, 0.81]]   ← image 4 aligns with text 4

  S / τ (τ=0.07):
  [[12.1, 3.0, 2.6, 3.3],
   [2.7, 11.7, 2.9, 2.1],   ← diagonal entries dominate after /τ
   [3.1, 2.4, 11.3, 3.4],
   [2.9, 3.3, 2.3, 11.6]]

  Softmax of row 1: [0.95, 0.02, 0.01, 0.02]  ← near-perfect alignment
  ITC loss ≈ -log(0.95) ≈ 0.05  (low: good alignment)

Step 4: Stage 2 projection to OPT-2.7B (d=2560)
  32 × 768 → 32 × 2560 via FC layer
  Prepend as first 32 "tokens" before text
  Total input to LLM: 32 (visual) + len(text) tokens

What’s clever — find the instinct:

The three Stage 1 objectives are not just arbitrary choices. They’re designed to be mutually exclusive shortcuts. A network that only does ITC can learn a global “vibe” of the image and ignore fine details. But ITG (generation) needs fine details to produce accurate captions. ITM needs discriminative features to detect subtle mismatches. By optimizing all three simultaneously, the queries are forced to encode a rich, language-grounded visual representation — no shortcut survives all three objectives.

The even deeper insight is the bottleneck itself:

“The size of Z (32×768) is much smaller than the size of frozen image features (e.g. 257×1024 for ViT-L/14). This bottleneck architecture works together with our pre-training objectives into forcing the queries to extract visual information that is most relevant to the text.”

It’s compression as a feature, not a bug. By forcing 257 rich patch embeddings through just 32 query slots, the Q-Former must make hard choices about what to keep. What survives the compression is whatever correlates with language — because that’s all the objectives care about.

“it effectively functions as an information bottleneck that feeds the most useful information to the LLM while removing irrelevant visual information. This reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem.”

And one more key observation from the ablation:

“Without representation learning, both types of LLMs give substantially lower performance on zero-shot VQA. In particular, OPT suffers from catastrophic forgetting where performance drastically degrades as training proceeds.”

Stage 1 isn’t warm-up. It’s load-bearing. Without it, the queries produce arbitrary visual representations that confuse the frozen LLM — it hasn’t seen anything like them before, so its gradients start pulling the Q-Former in directions that destroy coherent visual grounding. Stage 1 is the alignment pre-work that makes Stage 2 possible.

Does it work? What breaks?

ModelZero-shot VQAv2Trainable ParamsTotal Params
Flamingo80B56.3%10.2B80B
BLIP-2 ViT-g OPT-6.7B54.3%108M7.8B
BLIP-2 ViT-g FlanT5-XL63.1%107M4.1B
BLIP-2 ViT-g FlanT5-XXL65.2%108M12.1B

BLIP-2 with FlanT5-XXL beats Flamingo80B by 8.7 percentage points on zero-shot VQAv2 while training 54× fewer parameters. On image captioning (NoCaps, zero-shot), BLIP-2 scores CIDEr 119.7 vs 113.2 for BLIP (which trained 583M params end-to-end).

The paper also confirms a satisfying scaling property: “a stronger image encoder or a stronger LLM both lead to better performance.” BLIP-2 is genuinely modular — you can slot in better components and get better results without redesigning the bridge.

What doesn’t work:

No in-context learning. Give BLIP-2 three examples of visual question answering in the prompt, and performance doesn’t improve — unlike Flamingo, which was specifically designed for few-shot learning. The root cause is the pre-training data: BLIP-2 trains on single image-text pairs, so the Q-Former never learns correlations across multiple examples in a single context. Getting this right requires interleaved image-text sequences, which Flamingo uses (on a proprietary dataset) but BLIP-2 doesn’t.

The 32-token interface is also a commitment. The LLM receives exactly 32 visual tokens regardless of image content. A complex scene with dozens of objects gets the same budget as a plain white background. For tasks requiring fine-grained spatial reasoning or counting, this fixed-size bottleneck can be the limiting factor.

So what?

If you’re building a multimodal system, BLIP-2’s architecture is the right starting point. The recipe: take a frozen CLIP-scale vision encoder, take a frozen instruction-tuned LLM, train only the Q-Former bridge on a few million image-text pairs. You get a capable VLM at a fraction of the compute. When a better LLM ships, swap it in and re-train the bridge — you don’t throw away the vision encoder training or the LLM pre-training.

The first experiment worth running: ablate the two Stage 1 objectives one by one. The paper shows all three (ITC, ITG, ITM) contribute, but the ITG loss improves even image-text retrieval (tasks that don’t use generation at all) — suggesting the generation objective is doing more than generating text. It’s forcing the queries to maintain causal structure in their visual representation, which turns out to be broadly useful.

This connects directly to what clip-learning-transferable-visual-models established: contrastive alignment of image and text produces powerful cross-modal representations. BLIP-2 takes that insight a step further — a frozen CLIP encoder is already so well-aligned to language that you don’t need to re-train it at all. You just need a smarter interface.

The modular principle — freeze the experts, train only the connector — became the dominant paradigm for efficient VLMs. llava-visual-instruction-tuning simplified it further to a single projection layer plus instruction fine-tuning, trading some zero-shot depth for dramatic simplicity. The Q-Former’s richer two-stage training produces better zero-shot transfer; LLaVA’s simpler approach produces better fine-tuned performance. Pick based on what your task requires.

BLIP-2 shows that you can give a frozen LLM eyes with 188M parameters of bridge training — no need to retrain 80B.

Connections

Citation

arXiv:2301.12597

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. https://arxiv.org/abs/2301.12597