What It Is

A module that compresses a variable-length visual feature map into a fixed number of output tokens using learned latent queries. Regardless of input image size or resolution, the Perceiver Resampler outputs exactly N tokens (64 in Flamingo) — giving downstream language models a consistent, bounded visual representation to attend to.

Why It Matters

Language models expect fixed-length token sequences. Images come at wildly different resolutions — a thumbnail has 32×32 pixels, a high-res photo has 3000×2000. If you fed all spatial patches directly to the LM, context length would explode and vary unpredictably. The Perceiver Resampler solves this: fixed output size, any input size. It also acts as a learned compression that focuses on visually salient features rather than passing every pixel patch to the LM.

How It Works

N learnable latent query vectors (randomly initialized, trained end-to-end) cross-attend to the spatial feature grid produced by the vision encoder. The queries ask: “given this image, what should I store in each of my N slots?” The key-value pairs come from the image features; the queries are learned parameters shared across all images.

Input image (any resolution)
        ↓
Vision encoder (NFNet-F6 / ViT)
        ↓
Spatial feature grid: H×W patches, each D-dimensional
        ↓
Perceiver Resampler
  64 learnable queries ─────cross-attn──→ H×W visual features
  (shared across all images)              (as keys, values)
        ↓
64 × D visual token outputs  ← fixed size, always
        ↓
Fed as keys/values to gated cross-attention layers in the LM

The mechanism is borrowed from the Perceiver architecture (Jaegle et al. 2021) and DETR’s object query idea — learned latent vectors that aggregate information from a larger feature set via cross-attention.

Key Sources

  • flamingo-visual-language-model-few-shot-learning — introduced the Perceiver Resampler as the compression module between NFNet vision encoder and frozen Chinchilla LM; produces 64 tokens; ablations show it outperforms plain MLP and Transformer alternatives
  • cross-attention — the mechanism the Perceiver Resampler uses internally (queries attend to visual features)
  • multimodal-embeddings — output tokens are the visual embeddings the LM cross-attends to
  • vision-language-models — Perceiver Resampler is a key component in Flamingo-style VLMs
  • patch-embeddings — the visual features the Perceiver Resampler takes as input