ESC

Segment Anything

πŸ“„ Read on arXiv

Overview

Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmentation) and the largest segmentation dataset ever assembled (SA-1B, containing over 1 billion masks on 11 million images). The paper's core argument is that segmentation needs its own "GPT-3 moment": a single model trained on broad data that generalizes to new tasks and domains without retraining.

The key insight is that by defining segmentation as a prompt-driven task -- where a user provides points, bounding boxes, masks, or text, and the model returns valid segmentation masks -- the authors create an interface flexible enough to power a wide range of downstream applications (interactive editing, object proposals, instance segmentation, etc.) while being trainable with a simple per-prompt loss. This is directly analogous to how GPT-3 used prompting to unify diverse NLP tasks.

SAM achieved remarkable zero-shot transfer, matching or exceeding prior task-specific models on edge detection, object proposal generation, instance segmentation, and text-to-mask prediction without any task-specific fine-tuning. The model processes prompts in real time (~50ms per mask) thanks to its lightweight decoder, while the heavy image encoder runs once per image. With ~19,700 citations by early 2026, SAM became one of the most impactful computer vision papers of the decade and spawned a large ecosystem of derivative work across medical imaging, remote sensing, video segmentation, and 3D scene understanding.

Key Contributions

  • Promptable segmentation task: Defined a new task where any segmentation prompt (point, box, mask, or text) produces a valid segmentation mask, enabling a single pretrained model to serve as a component in diverse downstream systems
  • Segment Anything Model (SAM): A three-component architecture -- a ViT-based image encoder (MAE-pretrained), a flexible prompt encoder, and a lightweight transformer mask decoder -- that produces high-quality masks in real time
  • Data engine for SA-1B: A three-stage annotation pipeline (assisted-manual, semi-automatic, fully automatic) that used SAM-in-the-loop to progressively scale annotation from human-assisted to fully automatic, producing 1.1 billion masks on 11 million images
  • SA-1B dataset: The largest segmentation dataset by an order of magnitude, with 400x more masks than any prior dataset, high geographic and subject diversity, and responsible data practices (face blurring, license compliance)
  • Zero-shot transfer: Demonstrated strong generalization across 23 diverse segmentation datasets without any fine-tuning, establishing that foundation model scaling works for dense prediction tasks

Architecture / Method

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Segment Anything Model (SAM)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Input Image   β”‚                β”‚ Prompts           β”‚   β”‚
β”‚  β”‚ (1024x1024)   β”‚                β”‚ β”Œβ”€β”€β”€β” β”Œβ”€β”€β”€β” β”Œβ”€β”€β”β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚ β”‚Ptsβ”‚ β”‚Boxβ”‚ β”‚Mkβ”‚β”‚   β”‚
β”‚         β”‚                        β”‚ β””β”€β”¬β”€β”˜ β””β”€β”¬β”€β”˜ β””β”¬β”€β”˜β”‚   β”‚
β”‚         β”‚  (runs once            β”‚   β””β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚         β”‚   per image)           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β–Ό                                  β”‚            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Image Encoder   β”‚          β”‚  Prompt Encoder      β”‚  β”‚
β”‚  β”‚  ViT-H (632M)    β”‚          β”‚  Sparse: embeddings  β”‚  β”‚
β”‚  β”‚  MAE-pretrained   β”‚          β”‚  +positional enc     β”‚  β”‚
β”‚  β”‚       β”‚           β”‚          β”‚  Dense: convolutions  β”‚  β”‚
β”‚  β”‚       β–Ό           β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”‚  64x64 feature    β”‚                      β”‚            β”‚
β”‚  β”‚  map (256-dim)    β”‚                      β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚            β”‚
β”‚         β”‚                                   β”‚            β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                        β–Ό                                 β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚         β”‚   Lightweight Mask Decoder   β”‚                 β”‚
β”‚         β”‚   (2-layer transformer)      β”‚                 β”‚
β”‚         β”‚   prompt-to-image cross-attn β”‚                 β”‚
β”‚         β”‚   image-to-prompt cross-attn β”‚                 β”‚
β”‚         β”‚          β”‚                   β”‚                 β”‚
β”‚         β”‚          β–Ό                   β”‚                 β”‚
β”‚         β”‚   4x upsample + dynamic MLP β”‚                 β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                    β”‚  (~50ms on CPU)                     β”‚
β”‚                    β–Ό                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚         β”‚  3 Masks + IoU Scoresβ”‚                         β”‚
β”‚         β”‚  (multi-granularity) β”‚                         β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SAM architecture overview showing image encoder, prompt encoder, and mask decoder

SAM consists of three components designed to separate the expensive image processing from the interactive prompting:

Image Encoder: A Vision Transformer (ViT-H by default: 632M parameters, 16x16 patches) pretrained with Masked Autoencoders (MAE). The encoder processes a 1024x1024 input image and produces a 64x64 feature map with 256-dimensional embeddings. Crucially, this runs only once per image, enabling real-time interactive use.

Prompt Encoder: Handles two types of prompts. Sparse prompts (points, boxes, text) are encoded as learned embeddings plus positional encodings. Points use a learned embedding per foreground/background label plus a positional encoding of the point location. Boxes are represented as two points (top-left, bottom-right) with learned embeddings. Text prompts use CLIP's text encoder. Dense prompts (masks) are encoded via convolutions (two 2x2 stride-2 convolutions plus a 1x1 conv) and summed element-wise with the image embedding.

Mask Decoder: A lightweight transformer decoder (two layers) that uses prompt-to-image and image-to-prompt cross-attention, followed by a dynamic mask prediction head. The decoder takes as input the image embedding, prompt embeddings, and a set of output tokens (one per predicted mask plus an IoU prediction token). After the transformer layers, the image embedding is upsampled 4x via transposed convolutions, and a dynamic linear classifier (MLP-generated per-mask weights) produces the final masks at 256x256 resolution. The entire decoder runs in ~50ms on a CPU.

Ambiguity-aware output: When a prompt is ambiguous (e.g., a single point could refer to a part, subpart, or whole object), SAM predicts 3 masks simultaneously at different granularity levels, each with a predicted IoU score. The highest-IoU mask is selected by default, but the multi-mask output enables downstream systems to choose the appropriate granularity.

Training: SAM is trained with a combination of focal loss and dice loss on per-mask predictions. The model is trained to predict a valid mask for any prompt, where "valid" means it corresponds to at least one reasonable object or region. Training uses the SA-1B data with a mixture of point, box, and mask prompts simulated from ground truth.

Data engine stages and SA-1B dataset statistics

Data engine: The annotation pipeline that created SA-1B operated in three stages: 1. Assisted-manual (Stage 1): Professional annotators labeled masks using SAM with a browser-based tool. SAM provided automatic suggestions that annotators corrected. This produced 4.3M masks on 120K images. 2. Semi-automatic (Stage 2): SAM automatically detected confident masks, and annotators labeled additional unannotated objects, increasing mask diversity. This produced 5.9M additional masks on 180K images. 3. Fully automatic (Stage 3): SAM generated masks automatically using a 32x32 grid of point prompts per image, with NMS and filtering for quality. This produced the bulk of the dataset: ~1.1B masks on 11M images.

Results

Zero-shot transfer results on diverse segmentation benchmarks

Zero-shot single-point segmentation: Human evaluators rated SAM's masks as higher quality than those from the strongest baseline (RITM) in 7 out of 9 experiments when using a single foreground point prompt. SAM's masks improved further with additional prompts.

Task Metric SAM (zero-shot) Best prior (task-specific)
Edge detection (BSDS500) ODS 76.8 78.8 (HED, trained)
Object proposals (LVIS) AR@1000 59.3 63.0 (ViTDet-H)
Instance segmentation (LVIS) AP comparable ViTDet (trained on LVIS)
Single-point valid mask Human preference 7/9 wins RITM

Key findings: - SAM produces highly coherent object boundaries, even for objects not represented in its training data - Performance scales with encoder size: ViT-H > ViT-L > ViT-B across all evaluations - Multi-point prompting significantly improves mask quality (as expected), with diminishing returns after ~5-9 points - Box prompts generally outperform single-point prompts and approach oracle-prompted performance - On LVIS (1203 categories), SAM zero-shot AR@1000 of 59.3 does not overall exceed ViTDet-H (63.0), though SAM outperforms on medium/large objects and rare/common categories; ViTDet-H's advantage on small and frequent objects reflects LVIS-specific training biases - Automatic mask generation at scale (the 32x32 grid strategy) produces high-quality masks suitable for downstream model training

Limitations & Open Questions

  • No semantic understanding: SAM segments objects but does not classify them -- it produces masks without labels. Combining SAM with classification models (as in Grounded-SAM) is an active area
  • Fine structures: Performance degrades on thin structures (bicycle spokes, fence wires) and highly occluded objects
  • Real-time full pipeline: While the decoder is fast, the ViT-H image encoder takes ~0.15s per image on an A100, making truly real-time video segmentation challenging without model distillation (addressed by SAM 2 and EfficientSAM)
  • Text prompts: The text-to-mask capability was trained but not extensively evaluated; it was noted as a proof of concept rather than a mature feature
  • 3D and video: SAM operates on single images; extending to temporally consistent video segmentation and 3D scene segmentation are natural next steps (addressed by SAM 2 in 2024)
  • Domain gaps: While zero-shot transfer is strong, performance on specialized domains (medical imaging, satellite imagery) benefits from fine-tuning, raising questions about the limits of foundation model generalization

Connections

Related papers in the wiki: - An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale -- ViT, the backbone architecture for SAM's image encoder - Learning Transferable Visual Models From Natural Language Supervision -- CLIP, whose text encoder SAM uses for text prompts; both are vision foundation models trained on massive data - Deep Residual Learning For Image Recognition -- ResNet, the prior dominant paradigm for vision backbones that ViT/SAM superseded - Imagenet Classification With Deep Convolutional Neural Networks -- AlexNet, the beginning of the deep learning vision era that SAM's foundation model approach extends - Language Models Are Few Shot Learners -- GPT-3, the NLP foundation model whose prompting paradigm directly inspired SAM's promptable segmentation task - Attention Is All You Need -- Transformer architecture underlying both SAM's encoder and decoder - Foundation Models -- SAM as a seminal vision foundation model demonstrating that the foundation model paradigm extends to dense prediction - Perception -- SAM's segmentation capabilities are relevant to perception pipelines, particularly for generating training data and interactive annotation