Perceiver Resampler

What It Is

A module that compresses a variable-length visual feature map into a fixed number of output tokens using learned latent queries. Regardless of input image size or resolution, the Perceiver Resampler outputs exactly N tokens (64 in Flamingo) — giving downstream language models a consistent, bounded visual representation to attend to.

Why It Matters

Language models expect fixed-length token sequences. Images come at wildly different resolutions — a thumbnail has 32×32 pixels, a high-res photo has 3000×2000. If you fed all spatial patches directly to the LM, context length would explode and vary unpredictably. The Perceiver Resampler solves this: fixed output size, any input size. It also acts as a learned compression that focuses on visually salient features rather than passing every pixel patch to the LM.

How It Works

N learnable latent query vectors (randomly initialized, trained end-to-end) cross-attend to the spatial feature grid produced by the vision encoder. The queries ask: “given this image, what should I store in each of my N slots?” The key-value pairs come from the image features; the queries are learned parameters shared across all images.

Input image (any resolution)
        ↓
Vision encoder (NFNet-F6 / ViT)
        ↓
Spatial feature grid: H×W patches, each D-dimensional
        ↓
Perceiver Resampler
  64 learnable queries ─────cross-attn──→ H×W visual features
  (shared across all images)              (as keys, values)
        ↓
64 × D visual token outputs  ← fixed size, always
        ↓
Fed as keys/values to gated cross-attention layers in the LM

The mechanism is borrowed from the Perceiver architecture (Jaegle et al. 2021) and DETR’s object query idea — learned latent vectors that aggregate information from a larger feature set via cross-attention.

Key Sources

flamingo-visual-language-model-few-shot-learning — introduced the Perceiver Resampler as the compression module between NFNet vision encoder and frozen Chinchilla LM; produces 64 tokens; ablations show it outperforms plain MLP and Transformer alternatives

cross-attention — the mechanism the Perceiver Resampler uses internally (queries attend to visual features)
multimodal-embeddings — output tokens are the visual embeddings the LM cross-attends to
vision-language-models — Perceiver Resampler is a key component in Flamingo-style VLMs
patch-embeddings — the visual features the Perceiver Resampler takes as input

ML Wiki

Explorer

Perceiver Resampler

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Perceiver Resampler

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks