What It Is
A module that compresses a variable-length visual feature map into a fixed number of output tokens using learned latent queries. Regardless of input image size or resolution, the Perceiver Resampler outputs exactly N tokens (64 in Flamingo) — giving downstream language models a consistent, bounded visual representation to attend to.
Why It Matters
Language models expect fixed-length token sequences. Images come at wildly different resolutions — a thumbnail has 32×32 pixels, a high-res photo has 3000×2000. If you fed all spatial patches directly to the LM, context length would explode and vary unpredictably. The Perceiver Resampler solves this: fixed output size, any input size. It also acts as a learned compression that focuses on visually salient features rather than passing every pixel patch to the LM.
How It Works
N learnable latent query vectors (randomly initialized, trained end-to-end) cross-attend to the spatial feature grid produced by the vision encoder. The queries ask: “given this image, what should I store in each of my N slots?” The key-value pairs come from the image features; the queries are learned parameters shared across all images.
Input image (any resolution)
↓
Vision encoder (NFNet-F6 / ViT)
↓
Spatial feature grid: H×W patches, each D-dimensional
↓
Perceiver Resampler
64 learnable queries ─────cross-attn──→ H×W visual features
(shared across all images) (as keys, values)
↓
64 × D visual token outputs ← fixed size, always
↓
Fed as keys/values to gated cross-attention layers in the LM
The mechanism is borrowed from the Perceiver architecture (Jaegle et al. 2021) and DETR’s object query idea — learned latent vectors that aggregate information from a larger feature set via cross-attention.
Key Sources
- flamingo-visual-language-model-few-shot-learning — introduced the Perceiver Resampler as the compression module between NFNet vision encoder and frozen Chinchilla LM; produces 64 tokens; ablations show it outperforms plain MLP and Transformer alternatives
Related Concepts
- cross-attention — the mechanism the Perceiver Resampler uses internally (queries attend to visual features)
- multimodal-embeddings — output tokens are the visual embeddings the LM cross-attends to
- vision-language-models — Perceiver Resampler is a key component in Flamingo-style VLMs
- patch-embeddings — the visual features the Perceiver Resampler takes as input