What It Is

Promptable segmentation is the task of producing a pixel-level mask for the object implied by a user-provided cue (a point, bounding box, rough mask sketch, or text description), without requiring the model to be retrained for that specific object category.

Why It Matters

Classical segmentation models are closed-vocabulary: train Mask R-CNN on COCO and it can only segment the 80 COCO classes. Promptable segmentation breaks this constraint — the “what to segment” is specified at inference time by the prompt, not baked into the model at training time. This enables interactive annotation, domain transfer, and general-purpose vision pipelines.

How It Works

The key design is separating the image representation from the prompt interpretation:

  1. Image encoder (heavy, runs once): produces dense feature vectors for every image region
  2. Prompt encoder (lightweight): turns the user cue into embeddings the decoder can use
  3. Mask decoder (lightweight, fast): attends prompt embeddings to image features, upsamples to a binary mask

Because the heavy image encoding is precomputed, new prompts can be answered at interactive speeds (~50ms). The model may output multiple candidate masks at different granularity levels to handle ambiguity.

Key Sources