Concepts: multimodal-embeddings | vision-language-models | cross-attention | in-context-learning | perceiver-resampler Builds on: clip-learning-transferable-visual-models | an-image-is-worth-16x16-words | language-models-are-few-shot-learners Leads to: blip-2-bootstrapping-language-image-pretraining | llava-visual-instruction-tuning
The problem
GPT-3 showed something remarkable: show a large language model three examples of a task in the prompt — no gradient updates, no fine-tuning — and it can do the task. That in-context few-shot learning works. But GPT-3 only sees text. The prompt is a string. What if the task requires seeing images? What if you want to show the model three (image, caption) pairs and have it caption a fourth image? Nobody had cracked that.
The frustrating thing is that the pieces already exist. CLIP-style encoders can “see” images with impressive fidelity. Large language models can reason and generate text. But these two systems speak entirely different languages — CLIP produces a dense embedding vector, LLMs consume token sequences — and every previous attempt to bridge them required expensive full retraining, destroying the knowledge in both.
The core idea
Think about how you learned to read a magazine as a child. Magazines are sequences of images and text jumbled together — a photo, a caption, an article, another photo, an ad with text, a recipe with step-by-step images. After reading enough magazines, you’d internalized a pattern: images and text belong together, a picture near a sentence usually relates to it. Now show you three captioned images and ask you to caption a fourth: you do it effortlessly. Your “few-shot learning” capability for vision tasks came from pre-exposure to interleaved image-text sequences at scale.
That’s Flamingo’s core bet. Train on 43 million web pages of naturally interleaved images and text — the internet’s equivalent of magazines — and the model will acquire the same capability. The architectural challenge is getting a frozen LLM to “see” images at all, without destroying the language understanding it took billions of training tokens to build.
Here’s how it works, piece by piece.
The Perceiver Resampler — turning pixels into tokens. Images from the web come at all sizes and resolutions. An LLM expects a fixed-length token sequence. Something has to bridge that. Flamingo uses a Perceiver Resampler: 64 learnable latent query vectors that cross-attend to the image’s spatial feature grid (from a frozen NFNet-F6 encoder pretrained with contrastive learning). Whatever size the image, whatever number of spatial patches, the Perceiver Resampler outputs exactly 64 visual tokens.
“It takes as input a variable number of image or video features from the vision encoder and produces a fixed number of visual outputs (64), reducing the computational complexity of the vision-text cross-attention.”
Think of it as a fixed-size summary: 64 slots, each learning to ask “what’s important about this image that I should store?” The 64 latent queries are randomly initialized and trained to distill whatever is visually relevant.
Gated cross-attention-dense layers — giving the LM a visual sense organ. The frozen Chinchilla LM is left completely untouched. Instead, Flamingo inserts new trainable layers between the frozen layers: gated cross-attention blocks (Figure 4 in the paper) where the keys and values come from the 64 visual tokens and the queries come from the language tokens. The LM can now “glance” at the current image at multiple depths in its computation.
The clever safety mechanism is the tanh gate:
where is a learnable scalar initialized to 0. Since , at initialization the cross-attention output is multiplied by zero — the visual pathway contributes nothing and the model behaves exactly like the original LM. Training then gradually opens the gates as moves away from zero.
“Thus, at initialization, the model output matches that of the pretrained LM, improving training stability and final performance.”
Without this, early training is chaotic: the LM gets random garbage from the cross-attention layers and its gradients try to compensate, pulling the frozen-LM outputs in contradictory directions. Zero-initialization sidesteps this entirely.
Per-image attention masking — the secret to few-shot scaling. In a few-shot prompt with 4 image-text pairs, text token at position 500 shouldn’t cross-attend to image 1’s visual tokens — it should attend to the image immediately preceding it in the sequence. Flamingo enforces this with a masking scheme:
“At a given text token, the model attends to the visual tokens of the image that appeared just before it in the interleaved sequence, rather than to all previous images.”
This turns out to be crucial for generalization. During training, Flamingo only sees sequences with up to 5 images. But at inference, you can use 32 shots — because the cross-attention at each text token always looks at exactly one image’s 64 tokens (fixed cost), while the LM’s self-attention handles the accumulating context of previous image-caption pairs (as plain text). Visual cross-attention cost is constant per token; only self-attention scales with shots.
FEW-SHOT PROMPT STRUCTURE (4-shot VQA)
[<Image 1>] Q: What color is the sky? A: blue.
[<Image 2>] Q: What animal is shown? A: dog.
[<Image 3>] Q: How many people? A: three.
[<Image 4>] Q: What sport is being played? A: ___
CROSS-ATTENTION MASKING:
Text tokens after Image 1 ──cross-attn──▶ Image 1's 64 tokens
Text tokens after Image 2 ──cross-attn──▶ Image 2's 64 tokens
Text tokens after Image 3 ──cross-attn──▶ Image 3's 64 tokens
Text tokens after Image 4 ──cross-attn──▶ Image 4's 64 tokens
(Prior images accessed only via LM self-attention over accumulated text)
PERCEIVER RESAMPLER:
320×320 image ──NFNet-F6──▶ 10×10 spatial grid = 100 patch features
(each: 3072 floats)
100 × 3072 ──Perceiver Resampler──▶ 64 × 3072 visual tokens
Compression: 100 patches → 64 latent summaries (by learned cross-attention)
GATED XATTN-DENSE (at initialization):
α = 0
tanh(0) = 0.0
output = x_LM + 0.0 × CrossAttn(x_LM, visual_tokens)
= x_LM ← pure LM output, no visual influence
After training (α = 0.8):
tanh(0.8) = 0.664
output = x_LM + 0.664 × CrossAttn(x_LM, visual_tokens)
← visual and language information now blended
What’s clever — find the instinct. The instinct is this: GPT-3’s few-shot learning works because it was trained on sequences where the same format appears many times — question followed by answer, sentence followed by sentence, example followed by example. The model learned structure by seeing structure. Flamingo’s key insight is that the same logic applies to visual tasks, if and only if the training data contains interleaved image-text sequences with the same structural regularity. Every web page where a photo appears next to its caption is one training signal. Forty-three million web pages of these signals builds the associative machinery needed for few-shot visual prompting.
The ablation makes this concrete:
“removing the interleaved image-text dataset M3W leads to a decrease of more than 17% in performance while removing the conventional paired image-text pairs also decreases performance (by 9.8%)”
Paired image-text (ALIGN, LTIP) teaches image-language alignment. But M3W — the messy interleaved web data — teaches in-context structure. Both are necessary. Neither alone is sufficient.
Training data mixture (three ingredients):
- M3W (MultiModal MassiveWeb): ~43M webpages, images inserted at their DOM positions, up to 5 images per 256-token window. Teaches interleaved structure.
- ALIGN + LTIP: 1.8B + 312M image-text pairs. Teaches image-language alignment.
- VTP: 27M short video-text pairs. Teaches temporal visual understanding.
Training objective: weighted negative log-likelihood of text, conditioned on visual inputs — summed across all three dataset mixtures simultaneously, with gradients accumulated (not round-robin, which the ablation shows hurts by 4-8 points).
Does it actually work? What breaks?
| Model | 0-shot VQAv2 | 32-shot VQAv2 | OKVQA 32-shot | Trainable params |
|---|---|---|---|---|
| Flamingo-3B | 49.2% | 57.1% | 45.9% | ~1.5B |
| Flamingo-9B | 51.8% | 60.4% | 51.0% | ~2.2B |
| Flamingo-80B | 56.3% | 67.6% | 57.8% | 10.2B |
| Fine-tuned SOTA | — | 80.2% (444K examples) | 54.4% (10K examples) | — |
Flamingo-80B with 32 shots beats the fine-tuned SOTA on OKVQA (57.8% vs 54.4%) using 1000× fewer labeled examples. On 6 of 16 benchmarks, 32-shot Flamingo outperforms models that trained on thousands of labeled examples. This was the first time a model demonstrated this pattern reliably across diverse vision-language tasks.
Performance scales cleanly with both model size and number of shots — larger models extract more value from each additional example, though all models hit diminishing returns past 32 shots.
What doesn’t work:
Zero-shot ImageNet classification. CLIP-style contrastive models are purpose-built for retrieval/classification — Flamingo’s generative approach lags there. The architectures optimize for different things: Flamingo generates the best text given an image; CLIP finds the best image-text match.
“in-context learning is known to be highly sensitive to various aspects of the demonstrations”
Prompt sensitivity is real. The order of few-shot examples, their diversity, even their phrasing affects results significantly. Flamingo offers no principled way to select or order demonstrations — unlike gradient-based fine-tuning, which is deterministic and data-efficient once you have enough examples.
Beyond ~32 shots, inference compute scales linearly (self-attention over the growing context), which limits practical few-shot numbers. The approach doesn’t gracefully bridge to the hundreds-of-examples regime.
Hallucination: Flamingo inherits LM hallucination directly. The frozen LM’s tendency to generate plausible-but-wrong text doesn’t disappear; it just gets conditioned on visual context.
So what?
If you’re building a vision-language system today, you’re almost certainly not using Flamingo directly — you’re using one of its successors (blip-2-bootstrapping-language-image-pretraining, llava-visual-instruction-tuning) that learned from its architecture and training recipe but made it far more compute-efficient. BLIP-2’s Q-Former compresses visual features more aggressively (32 tokens vs 64) and separates vision-language training into two stages. LLaVA simplifies the bridge to a single projection layer and adds instruction tuning. Both trade some zero-shot depth for dramatic parameter efficiency.
The experiment worth running first: if you have a small labeled dataset for a visual task (50-200 examples), compare 32-shot Flamingo-style prompting against fine-tuning a smaller vision model. Flamingo’s data shows that with < 32 examples, in-context learning is often competitive and requires zero gradient updates — making it the right choice when you need fast prototyping or when your labeled set is tiny and precious.
Flamingo also establishes something important for the field: the in-context learning paradigm that language-models-are-few-shot-learners demonstrated for text generalizes to vision, but only if the training data contains interleaved multi-modal sequences — not just paired images and captions. The structure of the training data is as important as the architecture.
Flamingo installs eyes in a frozen language model by routing images through a Perceiver Resampler to 64 tokens and piping them into gated cross-attention layers — trained on 43M web pages of interleaved images and text to acquire few-shot visual learning from in-context examples alone.
Connections
- multimodal-embeddings — Perceiver Resampler produces 64 visual token embeddings that live in a space the LM cross-attention can query
- vision-language-models — Flamingo establishes the freeze-then-bridge pattern and the interleaved training data paradigm for VLMs
- cross-attention — gated xattn-dense layers are the mechanism by which frozen LM layers access visual tokens
- in-context-learning — Flamingo extends GPT-3-style few-shot prompting to multimodal (image + text) prompts
- contrastive-learning — NFNet-F6 vision encoder pretrained with CLIP-style contrastive objective
- perceiver-resampler — the module that compresses variable-size visual feature maps to a fixed 64 tokens
- clip-learning-transferable-visual-models — contrastively pretrained vision encoder is the perceptual backbone Flamingo builds on
- language-models-are-few-shot-learners — the GPT-3 paper establishing few-shot learning; Flamingo directly extends this paradigm to vision
- blip-2-bootstrapping-language-image-pretraining — subsequent work that replaces the Perceiver Resampler with a Q-Former trained in two explicit stages, beating Flamingo-80B with 54× fewer trainable parameters
- llava-visual-instruction-tuning — further simplification: single projection layer + instruction fine-tuning, optimizing for fine-tuned rather than zero-shot performance
Citation
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., & Simonyan, K. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. NeurIPS 2022. https://arxiv.org/abs/2204.14198