What It Is
Open-vocabulary segmentation is the task of producing pixel-level masks for objects specified by arbitrary natural language prompts, without restriction to a fixed set of predefined classes. It combines the generalization of open-vocabulary detection with the spatial precision of instance segmentation.
Why It Matters
Closed-vocabulary segmentation models (e.g., Mask R-CNN, Panoptic FPN) are limited to the classes seen during training. Open-vocabulary segmentation enables grounding to any concept expressible in language — including attributes (“blue car”), relations (“person holding umbrella”), and OCR-guided cues (“Diet Coke bottle”). This is essential for robotics, document understanding, and interactive vision systems.
How It Works
Open-vocabulary segmentation models typically:
- Encode the image into feature maps (via CNN, ViT, or early-fusion Transformer)
- Encode the text prompt and compute similarity or cross-attention between text and image features
- Produce instance masks, either via:
- Dot-product decoding: a learned query embedding is dot-producted with upsampled image features to produce binary masks
- Fixed-query decoders (e.g., Mask2Former): a set of object queries cross-attend to image features; Hungarian matching assigns them to ground truth
- Autoregressive interfaces (e.g., Chain-of-Perception): model predicts
<coord> → <size> → <seg>tokens sequentially, then decodes masks from seg token embeddings
A key challenge is presence calibration: the model must reliably output “absent” for objects not in the scene, not just produce masks whenever prompted.
Key Sources
Related Concepts
Open Questions
- How to improve presence calibration (MCC) without sacrificing recall?
- What is the right benchmark design for open-vocabulary segmentation? (RefCOCO is saturated; PBench is one alternative)
- Can open-vocabulary segmentation scale to video and 3D point clouds with the same architectures?