Segment Anything
π Read on arXiv
Overview
Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmentation) and the largest segmentation dataset ever assembled (SA-1B, containing over 1 billion masks on 11 million images). The paper's core argument is that segmentation needs its own "GPT-3 moment": a single model trained on broad data that generalizes to new tasks and domains without retraining.
The key insight is that by defining segmentation as a prompt-driven task -- where a user provides points, bounding boxes, masks, or text, and the model returns valid segmentation masks -- the authors create an interface flexible enough to power a wide range of downstream applications (interactive editing, object proposals, instance segmentation, etc.) while being trainable with a simple per-prompt loss. This is directly analogous to how GPT-3 used prompting to unify diverse NLP tasks.
SAM achieved remarkable zero-shot transfer, matching or exceeding prior task-specific models on edge detection, object proposal generation, instance segmentation, and text-to-mask prediction without any task-specific fine-tuning. The model processes prompts in real time (~50ms per mask) thanks to its lightweight decoder, while the heavy image encoder runs once per image. With ~19,700 citations by early 2026, SAM became one of the most impactful computer vision papers of the decade and spawned a large ecosystem of derivative work across medical imaging, remote sensing, video segmentation, and 3D scene understanding.
Key Contributions
- Promptable segmentation task: Defined a new task where any segmentation prompt (point, box, mask, or text) produces a valid segmentation mask, enabling a single pretrained model to serve as a component in diverse downstream systems
- Segment Anything Model (SAM): A three-component architecture -- a ViT-based image encoder (MAE-pretrained), a flexible prompt encoder, and a lightweight transformer mask decoder -- that produces high-quality masks in real time
- Data engine for SA-1B: A three-stage annotation pipeline (assisted-manual, semi-automatic, fully automatic) that used SAM-in-the-loop to progressively scale annotation from human-assisted to fully automatic, producing 1.1 billion masks on 11 million images
- SA-1B dataset: The largest segmentation dataset by an order of magnitude, with 400x more masks than any prior dataset, high geographic and subject diversity, and responsible data practices (face blurring, license compliance)
- Zero-shot transfer: Demonstrated strong generalization across 23 diverse segmentation datasets without any fine-tuning, establishing that foundation model scaling works for dense prediction tasks
Architecture / Method
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Segment Anything Model (SAM) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ βββββββββββββββββββββ β
β β Input Image β β Prompts β β
β β (1024x1024) β β βββββ βββββ βββββ β
β ββββββββ¬ββββββββ β βPtsβ βBoxβ βMkββ β
β β β βββ¬ββ βββ¬ββ ββ¬βββ β
β β (runs once β βββββββΌββββββ β β
β β per image) βββββββββββΌβββββββββ β
β βΌ β β
β ββββββββββββββββββββ βββββββββββββ΄ββββββββββ β
β β Image Encoder β β Prompt Encoder β β
β β ViT-H (632M) β β Sparse: embeddings β β
β β MAE-pretrained β β +positional enc β β
β β β β β Dense: convolutions β β
β β βΌ β βββββββββββββ¬ββββββββββ β
β β 64x64 feature β β β
β β map (256-dim) β β β
β ββββββββ¬ββββββββββββ β β
β β β β
β ββββββββββββββββ¬βββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββ β
β β Lightweight Mask Decoder β β
β β (2-layer transformer) β β
β β prompt-to-image cross-attn β β
β β image-to-prompt cross-attn β β
β β β β β
β β βΌ β β
β β 4x upsample + dynamic MLP β β
β ββββββββββββ¬ββββββββββββββββββββ β
β β (~50ms on CPU) β
β βΌ β
β ββββββββββββββββββββββββ β
β β 3 Masks + IoU Scoresβ β
β β (multi-granularity) β β
β ββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ

SAM consists of three components designed to separate the expensive image processing from the interactive prompting:
Image Encoder: A Vision Transformer (ViT-H by default: 632M parameters, 16x16 patches) pretrained with Masked Autoencoders (MAE). The encoder processes a 1024x1024 input image and produces a 64x64 feature map with 256-dimensional embeddings. Crucially, this runs only once per image, enabling real-time interactive use.
Prompt Encoder: Handles two types of prompts. Sparse prompts (points, boxes, text) are encoded as learned embeddings plus positional encodings. Points use a learned embedding per foreground/background label plus a positional encoding of the point location. Boxes are represented as two points (top-left, bottom-right) with learned embeddings. Text prompts use CLIP's text encoder. Dense prompts (masks) are encoded via convolutions (two 2x2 stride-2 convolutions plus a 1x1 conv) and summed element-wise with the image embedding.
Mask Decoder: A lightweight transformer decoder (two layers) that uses prompt-to-image and image-to-prompt cross-attention, followed by a dynamic mask prediction head. The decoder takes as input the image embedding, prompt embeddings, and a set of output tokens (one per predicted mask plus an IoU prediction token). After the transformer layers, the image embedding is upsampled 4x via transposed convolutions, and a dynamic linear classifier (MLP-generated per-mask weights) produces the final masks at 256x256 resolution. The entire decoder runs in ~50ms on a CPU.
Ambiguity-aware output: When a prompt is ambiguous (e.g., a single point could refer to a part, subpart, or whole object), SAM predicts 3 masks simultaneously at different granularity levels, each with a predicted IoU score. The highest-IoU mask is selected by default, but the multi-mask output enables downstream systems to choose the appropriate granularity.
Training: SAM is trained with a combination of focal loss and dice loss on per-mask predictions. The model is trained to predict a valid mask for any prompt, where "valid" means it corresponds to at least one reasonable object or region. Training uses the SA-1B data with a mixture of point, box, and mask prompts simulated from ground truth.

Data engine: The annotation pipeline that created SA-1B operated in three stages: 1. Assisted-manual (Stage 1): Professional annotators labeled masks using SAM with a browser-based tool. SAM provided automatic suggestions that annotators corrected. This produced 4.3M masks on 120K images. 2. Semi-automatic (Stage 2): SAM automatically detected confident masks, and annotators labeled additional unannotated objects, increasing mask diversity. This produced 5.9M additional masks on 180K images. 3. Fully automatic (Stage 3): SAM generated masks automatically using a 32x32 grid of point prompts per image, with NMS and filtering for quality. This produced the bulk of the dataset: ~1.1B masks on 11M images.
Results

Zero-shot single-point segmentation: Human evaluators rated SAM's masks as higher quality than those from the strongest baseline (RITM) in 7 out of 9 experiments when using a single foreground point prompt. SAM's masks improved further with additional prompts.
| Task | Metric | SAM (zero-shot) | Best prior (task-specific) |
|---|---|---|---|
| Edge detection (BSDS500) | ODS | 76.8 | 78.8 (HED, trained) |
| Object proposals (LVIS) | AR@1000 | 59.3 | 63.0 (ViTDet-H) |
| Instance segmentation (LVIS) | AP | comparable | ViTDet (trained on LVIS) |
| Single-point valid mask | Human preference | 7/9 wins | RITM |
Key findings: - SAM produces highly coherent object boundaries, even for objects not represented in its training data - Performance scales with encoder size: ViT-H > ViT-L > ViT-B across all evaluations - Multi-point prompting significantly improves mask quality (as expected), with diminishing returns after ~5-9 points - Box prompts generally outperform single-point prompts and approach oracle-prompted performance - On LVIS (1203 categories), SAM zero-shot AR@1000 of 59.3 does not overall exceed ViTDet-H (63.0), though SAM outperforms on medium/large objects and rare/common categories; ViTDet-H's advantage on small and frequent objects reflects LVIS-specific training biases - Automatic mask generation at scale (the 32x32 grid strategy) produces high-quality masks suitable for downstream model training
Limitations & Open Questions
- No semantic understanding: SAM segments objects but does not classify them -- it produces masks without labels. Combining SAM with classification models (as in Grounded-SAM) is an active area
- Fine structures: Performance degrades on thin structures (bicycle spokes, fence wires) and highly occluded objects
- Real-time full pipeline: While the decoder is fast, the ViT-H image encoder takes ~0.15s per image on an A100, making truly real-time video segmentation challenging without model distillation (addressed by SAM 2 and EfficientSAM)
- Text prompts: The text-to-mask capability was trained but not extensively evaluated; it was noted as a proof of concept rather than a mature feature
- 3D and video: SAM operates on single images; extending to temporally consistent video segmentation and 3D scene segmentation are natural next steps (addressed by SAM 2 in 2024)
- Domain gaps: While zero-shot transfer is strong, performance on specialized domains (medical imaging, satellite imagery) benefits from fine-tuning, raising questions about the limits of foundation model generalization
Connections
Related papers in the wiki: - An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale -- ViT, the backbone architecture for SAM's image encoder - Learning Transferable Visual Models From Natural Language Supervision -- CLIP, whose text encoder SAM uses for text prompts; both are vision foundation models trained on massive data - Deep Residual Learning For Image Recognition -- ResNet, the prior dominant paradigm for vision backbones that ViT/SAM superseded - Imagenet Classification With Deep Convolutional Neural Networks -- AlexNet, the beginning of the deep learning vision era that SAM's foundation model approach extends - Language Models Are Few Shot Learners -- GPT-3, the NLP foundation model whose prompting paradigm directly inspired SAM's promptable segmentation task - Attention Is All You Need -- Transformer architecture underlying both SAM's encoder and decoder - Foundation Models -- SAM as a seminal vision foundation model demonstrating that the foundation model paradigm extends to dense prediction - Perception -- SAM's segmentation capabilities are relevant to perception pipelines, particularly for generating training data and interactive annotation