Segment Anything - explained

Concepts: foundation models | promptable segmentation | zero-shot transfer | open-vocabulary segmentation Builds on: ViT | CLIP Leads to: (SAM 2 — video, 2024) | (Grounded-SAM, SAM + DINO)

Part 1: The problem

Every segmentation model before SAM was a specialist. Want to detect tumors in CT scans? Train a new model. Want to identify pothole edges in drone footage? Train another one. These models had different codebases, different label formats, different input resolutions, and none of them transferred. The same image — say, a photograph of a street — required entirely different models depending on whether you wanted “all cars,” “the car in front,” or “any object the user clicks on.” Computer vision had GPT-2-era NLP energy: every task its own silo, no shared foundation.

The question SAM asks is: what would a foundation model for vision segmentation even look like? NLP got one by training on text completion. What’s the vision analogue?

Part 2: How SAM works

The rubber stamp analogy

Think of a rubber stamp kit. You have a blank piece of paper (the image), a set of stamps (the prompts — dots, rectangles, freehand outlines, text descriptions), and an ink pad (the model). You press a stamp onto the paper at a specific location, and the model fills in the precise boundary of whatever object lives at that spot. Press a dot on the dog’s nose: you get the dog mask. Press a box around the car: you get the car mask. Press multiple dots: you get all the objects those dots point to.

The genius is separating the “where to look” (your prompt) from the “how to segment” (the model). The model never needs to know ahead of time what class of objects exists in your domain. It just needs to know: here’s a cue, find the boundary.

This is what the paper calls a “promptable segmentation task”: given an image and any spatial or semantic cue, return a valid segmentation mask for the object implied by that cue.

The mechanism

Three components working in sequence:

Image encoder. A heavyweight ViT-H — the biggest standard Vision Transformer — pretrained with MAE (masked autoencoder). It processes the raw image once and produces a dense embedding: a grid of feature vectors, one per patch, capturing what’s in every part of the image. This is the slow step. But it runs only once per image, regardless of how many prompts you throw at it afterward.

Prompt encoder. Lightweight. Takes the prompt (a point, a bounding box, a rough mask, or text) and encodes it into a small set of embeddings. Points and boxes become positional encodings with learned embeddings for “foreground point” vs. “background point.” Sparse text prompts are handled by passing them through CLIP’s text encoder.

Mask decoder. Very fast. Runs in ~50ms on CPU. Takes the image embedding + prompt embeddings and outputs up to three candidate masks (for ambiguous prompts — if you click on a table, do you want the table surface, or the whole table plus its legs, or the entire dining set?) plus a confidence score for each. The decoder uses cross-attention: the prompt attends to the image features to find the relevant region, then a lightweight upsampling head produces the final pixel mask.

SAM Architecture — one image embedding, fast prompt → mask inference:
─────────────────────────────────────────────────────────────────────────
                    Image (1024×1024)
                         │
              ┌──────────▼──────────┐
              │  ViT-H Image Encoder │  ← runs once, ~10s on CPU
              │  (MAE pretrained)    │
              └──────────┬──────────┘
                         │  image embedding (64×64×256)
                         ▼
              ┌──────────────────────────────┐
  Prompt ──►  │       Mask Decoder           │
 (point /     │  prompt embeds ↔ image       │
  box /        │  cross-attention + upsample  │  ← runs in ~50ms
  mask /       │                              │
  text)        └──────────┬───────────────────┘
                          │
               ┌──────────▼────────┐
               │  3 candidate masks │  (+ IoU scores)
               └───────────────────┘
─────────────────────────────────────────────────────────────────────────
The heavy lifting happens once. Every subsequent prompt is nearly free.

The math (only what matters)

The mask decoder uses two-way cross-attention. Let $T$ be the prompt token embeddings and $I$ be the image embedding (flattened to a sequence of patch vectors). The core operation is:

$T^{'} = Attention (Q = T, K = I, V = I)$

Translation: the prompt tokens ask questions of the image. Each prompt token $T_{i}$ asks “which image patches look relevant to me?” — and assembles a weighted mixture of image features as its answer.

Then the reverse direction:

$I^{'} = Attention (Q = I, K = T^{'}, V = T^{'})$

Translation: the image patches, having heard the prompt, now update their own representations based on what the prompt is looking for.

This bidirectional exchange happens twice in the lightweight decoder (2 layers). Then a final upsampling head takes the updated image features and produces a 256×256 mask, which is bilinearly upsampled to the original resolution.

Why output three masks? Because segmentation is ambiguous. If you click on a wheel, is the intended object: this wheel, this car, or everything on the road? The paper calls this “mask ambiguity” and handles it by predicting one mask at each of three granularity levels (part, whole, group). The predicted IoU score tells you which one the model is most confident about.

Numeric walkthrough

Let’s trace a simple case: one foreground point at pixel location (512, 512) in a 1024×1024 image.

Step 1 — Image encoding. The 1024×1024 image is divided into 16×16 patches (4096 patches total), each projected to a 256-dim embedding by the ViT-H encoder. The encoder uses window attention (local windows, not global) for most layers, with 4 global attention layers. Output: a 64×64×256 feature map.

Step 2 — Prompt encoding. The point (512, 512) maps to patch position (32, 32) in the 64×64 grid. It gets encoded as: a sinusoidal positional embedding of dimension 256, plus a learned “foreground point” embedding added elementwise. Result: one 256-dim token $T_{fg}$ .

Step 3 — Decoder cross-attention. $T_{fg}$ (1×256) attends to $I$ (4096×256). It computes attention scores with each of the 4096 image patches. The patches near position (32,32) and their nearby context get high scores. The decoder fuses this into an updated representation.

Step 4 — Upsampling. The decoder outputs a 256×256 binary mask. This gets bilinearly upsampled to 1024×1024. Pixels above threshold 0.0 are classified as foreground.

Step 5 — Three masks. The decoder produces three masks at different scales: e.g., “just the nose,” “the whole face,” “the person.” The predicted IoU scores might be [0.91, 0.87, 0.73]. The highest-confidence mask (the nose) gets returned as the default.

Total time after image encoding is done: ~50ms on CPU, ~5ms on GPU.

The data engine — the real clever part

The model is only half the story. The other half is how they built 1 billion training masks without paying 1 billion humans.

The paper describes a three-stage data engine:

“In the first stage, SAM was trained using publicly available segmentation datasets and used interactively by annotators to annotate images. In the second stage, SAM was used in a semi-automatic mode to generate candidate masks… In the third stage, SAM was used in a fully automatic mode.”

Translation: the model bootstrapped its own training data. Round 1 — humans annotated masks using SAM as a tool (SAM suggested, humans corrected). Round 2 — SAM auto-generated high-confidence masks, humans annotated what was missed. Round 3 — SAM ran fully automatically on 11M images, generating ~100 masks per image.

Each round made the model better, which made the annotations faster, which created more data. A flywheel.

The final dataset, SA-1B, has:

11M images (licensed, privacy-respecting)
1.1 billion masks (vs. the previous largest: Open Images with ~860K masks)
~100 masks per image on average
99.1% of masks auto-generated in Stage 3

This is the equivalent of CLIP’s WIT dataset for segmentation — a scale that nobody else could match because the data collection itself required the model they were trying to train.

“We find that SAM trained with our data engine performs competitively with or superior to prior work even with simpler architectures.”

Translation: the data mattered as much as the architecture.

What’s clever

The insight is architectural modularity as a design principle. Previous foundation model attempts for vision (ViT pretrained on ImageNet, CLIP trained contrastively) produced image features, but not masks. SAM adds the mask decoder as a thin, task-specific head — one that is fast enough to run interactively (under 50ms) but powerful enough to handle ambiguity.

The key trick is the inversion of the prompt–image relationship: instead of prompting the image to output text (like CLIP), SAM prompts the image to output spatial structure. And by keeping the decoder lightweight, the expensive image encoding becomes amortized over all the prompts an interactive user might throw at it in a session.

“By pre-computing the image embedding and then prompting the model with different inputs in real-time, the user experience is responsive.”

Translation: encode once, interact forever. That’s what makes SAM feel like a magic wand rather than a batch segmentation pipeline.

Part 3: Does it actually work?

Zero-shot results

Task	SAM (zero-shot)	Prior SOTA (supervised)	Notes
COCO instance segmentation (AP)	46.5	58.6 (ViTDet-H)	Gap is real; SAM not optimized for this
Edge detection (ODS on BSDS500)	76.8	78.7 (EDTER, supervised)	Competitive with supervised at zero-shot
Point-prompted segmentation (mIoU)	78.9	75.9 (strongest supervised)	Beats supervised on 16 of 23 datasets

The headline number is point-prompted segmentation: SAM zero-shot at 78.9 mIoU vs. 75.9 for the supervised best across 23 holdout datasets. That’s the headline claim the paper leans on — and it holds.

The COCO gap (46.5 vs. 58.6) is expected. COCO is a fixed-class closed-vocabulary benchmark with class labels; SAM was never trained to do closed-vocabulary recognition. It’s like measuring a polyglot translator on a Spanish-only vocabulary test.

What doesn’t work

SAM is not a recognition model. It cannot tell you what an object is — only where it is. If you want labels (“this is a car”), you need to compose SAM with something like CLIP or a classifier.

Fine-grained texture segmentation is weak. SAM was trained primarily on salient objects. Thin structures (hair, wires, fence mesh), transparent objects (glass, water), and camouflaged objects are all identified in the paper as failure modes.

“SAM struggles with segmenting fine structures such as thin fur and the perforated structure of objects, and with low-contrast regions in images.”

Text prompts are limited. The text encoder is plugged in (CLIP’s) but performance via text is noticeably weaker than via clicks or boxes. Text-prompted SAM is not the same as language-grounded segmentation models like GLIP or Grounded-SAM.

Speed vs. quality tradeoff. ViT-H is the only encoder the paper thoroughly evaluates. Smaller encoders (ViT-B, ViT-L) exist but the zero-shot transfer quality drops — the paper acknowledges that “a smaller model may be desirable in some applications” but doesn’t fully characterize the tradeoff.

Part 4: So what?

If you’re building vision systems, SAM changes what “starting point” means. The old workflow was: collect labeled segmentation data for your domain, train a custom model, hope it generalizes. The new workflow is: prompt SAM with a few clicks, get masks, use those masks to bootstrap a domain-specific fine-tune if needed. For most interactive labeling tools, SAM is now the default under the hood — it’s what Roboflow, Label Studio, and similar annotation platforms integrated within weeks of the release.

When to use it: interactive annotation, open-ended object detection, image editing (background removal, inpainting), any task where you have spatial cues but no labels. When not to use it: when you need object class labels (SAM gives masks, not categories), when your objects are thin, transparent, or camouflaged, or when you need real-time video segmentation at the patch level (SAM 2 addresses video; SAM 1 is image-only).

SAM is to segmentation what CLIP was to image-text matching: a pretrained backbone so strong that fine-tuning it with a tiny amount of domain data beats training from scratch with a lot. And like CLIP, it follows the same pattern as ViT: take a transformer architecture, pretraining at unprecedented scale, watch the zero-shot results land above supervised baselines. The architectural innovation is modest. The data and scale are the moat.

SAM turns segmentation from a closed-vocabulary per-class problem into a general promptable interface — encode once, segment anything.

Connections

ViT — SAM’s image encoder is a ViT-H pretrained with MAE
CLIP — SAM’s text prompt encoder reuses CLIP; same paradigm of zero-shot transfer
foundation models — SAM is the first vision foundation model for segmentation
promptable segmentation — the core task SAM defines
open-vocabulary segmentation — SAM enables this without class supervision

Citation

arXiv:2304.02643

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything. arXiv preprint arXiv:2304.02643.

ML Wiki

Explorer