Concepts: promptable-segmentation | foundation-models | vision-transformer | zero-shot-transfer Builds on: segment-anything | an-image-is-worth-16x16-words
SAM solved promptable segmentation in still images. SAM 2 extends it to video, where the unit of work is no longer “given a click, segment this object” but “given a click on frame 17, segment this same object across all 1,800 frames, even when it disappears behind a tree at frame 432 and comes back at 580.” The trick is a streaming memory bank: each new frame attends not just to its own pixels but to a small set of “remembered” past frames whose embeddings get cached.
The core idea
The analogy: SAM is a sketch artist who draws a single portrait when shown a face. SAM 2 is the same artist asked to draw the same person across an entire photo album, where between shots the subject changes clothes, ducks behind furniture, and reappears at a different angle. To stay consistent, the artist keeps a small reference folder of previous drawings; for each new photo, they glance at the folder before drawing.
That folder is the memory bank. Each new frame:
- Goes through an image encoder that produces patch embeddings (same as SAM, but using Hiera, a hierarchical ViT).
- Cross-attends those embeddings against the memory bank, which stores spatial features and predicted masks from a small handful of recent and “important” frames.
- The mask decoder uses the memory-conditioned features to predict the mask for the current frame.
- The result and its features get encoded and pushed into the memory bank for use by future frames.
“Our model is a simple transformer architecture with streaming memory for real-time video processing.”
The whole pipeline is causal in time: frame N never sees frame N+1. This is what makes it a streaming model rather than a batch video model. You can keep pushing frames in indefinitely; cost per frame stays roughly constant.
What’s clever — find the instinct
The non-obvious move is treating segmentation as a sequence model, where the “tokens” are not patches but whole frame embeddings. SAM 1 trained one image at a time. The first instinct for video would be to stack frames as a 3D volume and run 3D attention; that scales as O(T^2) in frames. SAM 2 instead uses a fixed-size memory bank (typically the most recent 6 frames plus a handful of “prompted” frames where the user clicked). Cost is O(T) in time.
The second clever move is the data engine. SAM’s 1B-mask dataset took an iterative human-in-the-loop process. SAM 2 reuses the trick at video scale: an early model annotates videos automatically, humans correct mistakes, the corrected data trains a stronger model, repeat. The result is the SA-V dataset, the largest video segmentation corpus ever assembled (35.5M masks across 50.9K videos).
“We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.”
The third clever move: when no prompt has been given for the current frame (the typical case in tracking), the model still emits a mask, conditioned only on the memory bank. The same architecture handles “first-frame click then track” and “interactive correction at frame 47” and “fresh prompt mid-video” with a unified mechanism.
Walkthrough: tracking a cat through a 30-second video
Setup: 30-second video at 24 FPS = 720 frames.
User clicks on the cat in frame 0.
Frame 0 (prompt frame):
- Image encoder produces feature map (Hiera output)
- Prompt encoder embeds the click
- Memory bank is empty
- Mask decoder uses image features + click only
- Output: mask M_0
- Push (features_0, M_0, click_0) into memory bank
Frame 1 (no prompt):
- Image encoder produces features
- Memory attention: cross-attend new features
against memory bank (just frame 0 so far)
- Mask decoder produces M_1 from memory-conditioned features
- Push (features_1, M_1) into memory bank
Frames 2..N:
- Same as frame 1
- Memory bank grows but is FIFO-clipped to ~6 recent frames
- The "prompted" frame 0 stays pinned (anchor frame)
Frame 432 (cat goes behind tree):
- Memory-attention provides "what cat looks like" from anchor + recent
- Mask decoder may emit empty mask (cat occluded)
- Empty mask still gets pushed; signals "cat not visible"
Frame 580 (cat reappears):
- Memory-attention recognizes cat features from frame 0 anchor
- Mask reappears, identity preserved
- Tracking resumes
Compare to running SAM 1 frame by frame: each frame would be a fresh prompt, no identity propagation, the user would have to re-click after every occlusion. SAM 2’s memory bank is what eliminates the re-click problem.
Does it work? What breaks?
Headline numbers from the paper:
| Task | SAM 2 | SAM 1 (per-frame) | Prior video SOTA |
|---|---|---|---|
| Image seg (23 datasets, mIoU) | better | baseline | — |
| Image seg, speed | 6x faster | 1x | — |
| Video seg (J&F, 17 zero-shot datasets) | better | — | 3x more user clicks needed |
| Real-time inference (V100) | ~44 FPS | — | — |
The 6x image speedup over SAM 1 comes from Hiera being more efficient than the original ViT-H. The 3x reduction in user clicks for matching prior video segmentation accuracy is the headline product claim: the same annotation budget produces 3x more labeled video.
“In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).”
What breaks:
- Long-term identity (over hundreds of frames) drifts when objects change appearance dramatically (a person putting on a hat, a car turning so its color shifts in shadow). The fixed-size memory bank can’t hold every appearance variation.
- Multiple instances of the same class (two identical-looking cats) are handled by separate masklet predictions but identity swaps occur when they cross.
- Real-time at 44 FPS assumes frame size around 1024x1024; native 4K video runs slower or requires downsampling.
- The streaming-only design means no use of future frames. Offline applications that could afford bidirectional context cannot benefit; this is a hard architectural choice.
So what?
For a practitioner working with segmentation pipelines:
- Stop running SAM frame by frame. If you have video, SAM 2 with its memory bank is strictly better; identity propagation is free.
- Treat segmentation as zero-shot infrastructure. Like CLIP for retrieval or Whisper for transcription, SAM 2 is a “default” you can drop into any pixel-level vision task without training a custom segmentor.
- For active learning, use the data engine recipe. Start with a weak model annotating, have humans correct, retrain. The 35.5M-mask SA-V dataset shows this scales further than people thought possible.
- The streaming memory pattern is a generalizable design. It is essentially a small KV cache for spatial features. Other “process video as it arrives” systems (gesture recognition, surveillance, robotics) can borrow the same shape.
For Saikat’s segmentation stack specifically: any production application that currently runs SAM 1 per frame should migrate; the speedup alone (6x) is a free win, and video tasks gain identity tracking that previously required a separate tracker. The promptable interface stays unchanged so existing tooling carries over.
Connections
- promptable-segmentation — extends SAM’s promptable interface from image to video
- foundation-models — second-generation segmentation foundation model
- vision-transformer — uses Hiera, a hierarchical ViT
- zero-shot-transfer — generalizes to unseen objects across image and video
- kv-cache — memory bank is essentially a small KV cache for spatial features
- long-context — streaming memory enables long-video processing
- segment-anything — direct predecessor; SAM 2 inherits the promptable-seg framing
- an-image-is-worth-16x16-words — patch-based vision transformers
- meta-ai-fair — author lab
Citation
Ravi, N., Gabeur, V., Hu, Y. T., et al. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint. https://arxiv.org/abs/2408.00714