Open-Vocabulary Segmentation

What It Is

Open-vocabulary segmentation is the task of producing pixel-level masks for objects specified by arbitrary natural language prompts, without restriction to a fixed set of predefined classes. It combines the generalization of open-vocabulary detection with the spatial precision of instance segmentation.

Why It Matters

Closed-vocabulary segmentation models (e.g., Mask R-CNN, Panoptic FPN) are limited to the classes seen during training. Open-vocabulary segmentation enables grounding to any concept expressible in language — including attributes (“blue car”), relations (“person holding umbrella”), and OCR-guided cues (“Diet Coke bottle”). This is essential for robotics, document understanding, and interactive vision systems.

How It Works

Open-vocabulary segmentation models typically:

Encode the image into feature maps (via CNN, ViT, or early-fusion Transformer)
Encode the text prompt and compute similarity or cross-attention between text and image features
Produce instance masks, either via:
- Dot-product decoding: a learned query embedding is dot-producted with upsampled image features to produce binary masks
- Fixed-query decoders (e.g., Mask2Former): a set of object queries cross-attend to image features; Hungarian matching assigns them to ground truth
- Autoregressive interfaces (e.g., Chain-of-Perception): model predicts <coord> → <size> → <seg> tokens sequentially, then decodes masks from seg token embeddings

A key challenge is presence calibration: the model must reliably output “absent” for objects not in the scene, not just produce masks whenever prompted.

Key Sources

Open Questions

How to improve presence calibration (MCC) without sacrificing recall?
What is the right benchmark design for open-vocabulary segmentation? (RefCOCO is saturated; PBench is one alternative)
Can open-vocabulary segmentation scale to video and 3D point clouds with the same architectures?

ML Wiki

Explorer

Open-Vocabulary Segmentation

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Open-Vocabulary Segmentation

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks