ESC

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

📄 Read on arXiv

Overview

Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- including the vast majority of voxels that are empty. SparseOcc rethinks this by operating exclusively on non-empty (occupied) voxels in a sparse latent space, dramatically reducing computational cost while actually improving prediction quality. The core observation is that real-world driving scenes are inherently sparse: typically only ~20% of voxels in the perception volume are occupied after LSS lifting (approximately 80% are empty), yet dense methods expend equal computation on empty and occupied regions.

The method introduces three novel components that work together: (1) a Sparse Latent Diffuser that propagates features among sparse voxels using decomposed orthogonal sparse convolutions along the X, Y, and Z axes, avoiding full 3D convolutions; (2) a Sparse Feature Pyramid that constructs multi-scale representations from the sparse voxels, combining coarse global context with fine-grained local detail; and (3) a Sparse Transformer Head that performs contextual reasoning on sparse data through self-attention among non-empty voxels. The sparse voxels are stored in COO (coordinate) format throughout, enabling memory-efficient processing.

On the nuScenes-Occupancy benchmark, SparseOcc achieves a 74.9% reduction in FLOPs (from 1810G to 455G) and a 31.6–40.9% reduction in memory (13G vs. 19–21G) compared to dense baselines, while improving mIoU from 12.8% to 14.1% (IoU: 21.8%). The method is additionally evaluated on SemanticKITTI. This is a notable result: the sparse method is not just more efficient but also more accurate, likely because removing the distraction of empty-space processing allows the network to focus representational capacity on the occupied regions that actually matter for downstream planning. The approach demonstrates that sparse processing is not merely a computational shortcut but a fundamentally better inductive bias for occupancy prediction.

Key Contributions

  • Sparse latent representation for occupancy: First method to process only non-empty voxels throughout the entire occupancy prediction pipeline, using COO-format sparse tensors for memory efficiency
  • Sparse Latent Diffuser: Two-part module: a Sparse Completion Block that propagates features to adjacent empty regions via sequential orthogonal convolutions (k×k×1, k×1×k, 1×k×k), and a Contextual Aggregation Block using decomposed sparse convolutions that reduce parameter count from O(k³) to O(3k²) or O(4k²)
  • Multi-scale Sparse Feature Pyramid: Constructs coarse-to-fine sparse feature hierarchies, combining global scene context (from coarser levels) with local geometric detail (from finer levels)
  • Sparse Transformer Head: Self-attention restricted to non-empty voxels for contextual reasoning, avoiding the quadratic cost of attending over the full volume
  • Simultaneous efficiency and accuracy gains: 74.9% FLOP reduction with improved mIoU (12.8% to 14.1%), showing that sparsity is a better inductive bias, not just a compression trick

Architecture / Method

SparseOcc architecture overview

  Multi-Camera Images
         
         
  ┌──────────────────┐
   2D Backbone        (ResNet-50)
   + View Transform 
  └────────┬─────────┘
            initial 3D volume
           
  ┌──────────────────────────────────────────────┐
    Mask Predictor: identify occupied voxels    
    Retain ~20% ──► COO sparse tensor            
  └────────┬─────────────────────────────────────┘
            sparse voxels (coords, features)
           
  ┌──────────────────────────────────────────────┐
    Sparse Latent Diffuser                      
    ┌──────────┐ ┌──────────┐ ┌──────────┐     
     X-axis     Y-axis     Z-axis        
     Sparse     Sparse     Sparse        
     Conv       Conv       Conv          
    └────┬─────┘ └────┬─────┘ └────┬─────┘     
         └──────────┬──┘────────────┘           
                      combine                  
  └──────────────────┼───────────────────────────┘
                     
                     
  ┌──────────────────────────────────────────────┐
    Sparse Feature Pyramid                      
    Coarse (global context) ◄──► Fine (detail)  
    Top-down + lateral connections (all sparse) 
  └──────────────────┬───────────────────────────┘
                     
                     
  ┌──────────────────────────────────────────────┐
    Sparse Transformer Head                     
    Self-attention among non-empty voxels only  
    ──► Semantic occupancy predictions          
  └──────────────────────────────────────────────┘

The SparseOcc pipeline processes multi-view camera images through three stages:

1. Initial Sparse Representation: Multi-view images are processed through a standard 2D backbone (e.g., ResNet-50) and view transformer to produce an initial 3D feature volume. A mask prediction module identifies which voxels are likely occupied, and only these voxels are retained in COO format -- a sparse tensor storing (coordinates, features) pairs for non-empty locations only. This initial sparsification step is critical: it discards the ~80% of voxels that are empty before any expensive 3D processing begins.

2. Sparse Latent Diffuser: The sparse voxel features are refined through two sub-blocks. The Sparse Completion Block propagates features into adjacent empty voxels via sequential orthogonal convolutions (k×k×1, then k×1×k, then 1×k×k), extending the sparse representation to neighboring space. The Contextual Aggregation Block then applies decomposed sparse convolutions across all non-empty voxels, reducing the parameter count from O(k³) to O(3k²) or O(4k²) compared to full 3D convolutions. Together, these blocks enable efficient feature propagation while avoiding the cost of densifying the representation.

3. Sparse Feature Pyramid: To capture multi-scale context, the sparse voxels are organized into a feature pyramid with multiple resolution levels. At coarser levels, sparse voxels are grouped and pooled to provide global scene understanding (e.g., road layout, building outlines). At finer levels, the original sparse voxels retain detailed local geometry. Features from different scales are fused through top-down and lateral connections, all operating in sparse format.

4. Sparse Transformer Head: The final prediction head applies self-attention among the sparse voxels. Because attention is restricted to the ~20% of voxels that are non-empty, the quadratic cost of self-attention is dramatically reduced compared to attending over the full dense volume. This enables rich contextual reasoning -- e.g., recognizing that a set of sparse voxels forms a vehicle rather than isolated noise -- while remaining computationally feasible.

Qualitative comparison with dense methods

The sparse processing has two important qualitative effects visible in the results: (1) SparseOcc produces cleaner predictions for continuous structures like roads and buildings because it avoids the hallucination artifacts that dense methods produce in empty regions; and (2) the method scales more gracefully to larger perception ranges because computation grows with the number of occupied voxels rather than the volume of the perception region.

Results

Quantitative results

Key results on nuScenes-Occupancy (SparseOcc is also evaluated on SemanticKITTI in the paper):

Method IoU mIoU FLOPs (relative) Notes
SparseOcc 21.8% 14.1% 25.1% (74.9% reduction) Sparse processing throughout; 13G memory
Dense baseline 12.8% 100% Full 3D volume processing; 19-21G memory
OccFormer 12.32% ~100% Dual-path transformer baseline
FlashOcc ~12-13% ~25-33% 2D-only with C2H reshape

The most striking result is that SparseOcc improves accuracy while reducing compute -- a rare combination. Ablation studies validate each component:

  • Sparse Completion Block (Diffuser): Removing the Sparse Completion Block reduces mIoU by ~1.1%, with the Contextual Aggregation Block also outperforming standard 3D convolutions
  • Sparse Feature Pyramid: Single-scale sparse processing (no pyramid) drops mIoU by ~0.8%, confirming the importance of multi-scale context even in sparse settings
  • Sparse Transformer Head: Replacing self-attention with CNN-based decoding on sparse voxels reduces mIoU by ~0.7%, demonstrating the value of inter-voxel contextual reasoning

The qualitative results show that SparseOcc produces notably fewer hallucinations in empty regions and better preserves continuous surface geometry compared to dense baselines.

Limitations & Open Questions

  • Initial mask quality: The approach depends on an accurate initial sparsification step -- if occupied voxels are missed by the mask predictor, they cannot be recovered downstream. The quality of the mask predictor is a bottleneck.
  • Irregular sparsity patterns: COO-format sparse tensors are less hardware-friendly than dense tensors on standard GPUs. While FLOPs are reduced, wall-clock speedup may be less dramatic due to irregular memory access patterns and lack of optimized sparse CUDA kernels for all operations.
  • Temporal integration: The paper primarily addresses single-frame occupancy. Extending sparse representations to temporal fusion (aligning and merging sparse voxels across frames) is non-trivial because the set of occupied locations changes over time.
  • Scalability to larger ranges: While sparse processing scales better than dense in principle, the initial view transformer still operates in dense mode. A fully sparse pipeline from images to output remains an open challenge.
  • Benchmark limitations: The nuScenes-Occupancy benchmark has known limitations in ground truth quality, particularly for distant and thin structures. Performance differences in the 12-15% mIoU range should be interpreted cautiously.

Connections

Related papers in the wiki: