Visual Grounding

What It Is

Visual grounding is the task of localizing one or more regions in an image given a natural language expression (e.g., “the red car on the left”). It includes referring expression comprehension (REC), open-vocabulary detection, and open-vocabulary segmentation, and is a core capability for vision-language models operating in the physical world.

Why It Matters

Visual grounding connects language semantics to spatial perception. It is the bridge between what a model “understands” linguistically and what it can locate and act on in an image. Compositional grounding — handling attributes, spatial constraints, relational context, and OCR cues — is far harder than simple object class lookup and distinguishes genuinely capable VLMs from pattern-matching systems.

How It Works

Grounding approaches vary by architecture:

Modular pipelines: a frozen vision backbone extracts features; a separate language-guided decoder produces bounding boxes or masks via cross-attention and Hungarian matching
Early-fusion Transformers: image and text tokens share a single backbone; task tokens predict geometry and mask embeddings sequentially (e.g., Chain-of-Perception: <coord> → <size> → <seg>)
Query-based detectors (e.g., DETR variants): a fixed set of learned object queries cross-attend to image features; Hungarian matching assigns predictions to ground truth

Open-vocabulary grounding generalizes beyond fixed class lists to arbitrary natural language prompts, requiring alignment between visual features and text embeddings.

Key Sources

falcon-perception-vlm

Open Questions

How do grounding models scale with dataset size versus model size?
Can grounding be made robust to ambiguous or underspecified prompts?
What is the right evaluation protocol for compositional grounding beyond RefCOCO?

ML Wiki

Explorer

Visual Grounding

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Visual Grounding

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks