What It Is

Open-vocabulary segmentation is the task of producing pixel-level masks for objects specified by arbitrary natural language prompts, without restriction to a fixed set of predefined classes. It combines the generalization of open-vocabulary detection with the spatial precision of instance segmentation.

Why It Matters

Closed-vocabulary segmentation models (e.g., Mask R-CNN, Panoptic FPN) are limited to the classes seen during training. Open-vocabulary segmentation enables grounding to any concept expressible in language — including attributes (“blue car”), relations (“person holding umbrella”), and OCR-guided cues (“Diet Coke bottle”). This is essential for robotics, document understanding, and interactive vision systems.

How It Works

Open-vocabulary segmentation models typically:

  1. Encode the image into feature maps (via CNN, ViT, or early-fusion Transformer)
  2. Encode the text prompt and compute similarity or cross-attention between text and image features
  3. Produce instance masks, either via:
    • Dot-product decoding: a learned query embedding is dot-producted with upsampled image features to produce binary masks
    • Fixed-query decoders (e.g., Mask2Former): a set of object queries cross-attend to image features; Hungarian matching assigns them to ground truth
    • Autoregressive interfaces (e.g., Chain-of-Perception): model predicts <coord> → <size> → <seg> tokens sequentially, then decodes masks from seg token embeddings

A key challenge is presence calibration: the model must reliably output “absent” for objects not in the scene, not just produce masks whenever prompted.

Key Sources

Open Questions

  • How to improve presence calibration (MCC) without sacrificing recall?
  • What is the right benchmark design for open-vocabulary segmentation? (RefCOCO is saturated; PBench is one alternative)
  • Can open-vocabulary segmentation scale to video and 3D point clouds with the same architectures?