ESC

OccMamba: Semantic Occupancy Prediction with State Space Models

Read on arXiv

Overview

OccMamba is the first Mamba-based network for semantic occupancy prediction, replacing transformer architectures' quadratic complexity with Mamba's linear complexity while maintaining global context modeling. The core challenge is that Mamba operates on 1D sequences, but occupancy prediction requires 3D spatial reasoning.

The paper introduces a height-prioritized 2D Hilbert expansion -- a novel 3D-to-1D reordering scheme that maximally preserves spatial locality when flattening 3D voxels into a sequence for Mamba processing. This is combined with a hierarchical Mamba module for multi-scale global context and a Local Context Processor for fine-grained details. OccMamba achieves +5.1% IoU and +4.3% mIoU over the previous state-of-the-art (Co-Occ) on OpenOccupancy, while reducing training memory by 38% and inference time by 65%.

Key Contributions

  • First Mamba-based occupancy network: Demonstrates that state space models can replace transformers for 3D semantic occupancy prediction with linear instead of quadratic complexity
  • Height-prioritized 2D Hilbert expansion: Novel 3D-to-1D serialization preserving spatial structure by first collapsing the height dimension then applying a 2D Hilbert space-filling curve, achieving 25.2% mIoU vs 24.5% for naive XYZ ordering
  • Hierarchical Mamba module: Multi-resolution encoder-decoder with downsampling/upsampling and skip connections for multi-scale global context aggregation
  • Local Context Processor: Overlapping patch division with local Mamba processing and 3D convolution reassembly for fine-grained semantic details
  • Significant efficiency gains: 38% training memory reduction, 65% inference time reduction over transformer-based alternatives

Architecture / Method

┌──────────────┐   ┌──────────────────┐
│  LiDAR       │   │  Multi-Camera     │
│  Voxelize +  │   │  ResNet + FPN +   │
│  Sparse Conv │   │  View Transform   │
└──────┬───────┘   └────────┬─────────┘
       │                    │
       └────────┬───────────┘
                │ concat (3D features)
                ▼
┌───────────────────────────────────────────────┐
│    Height-Prioritized 2D Hilbert Expansion     │
│  3D voxels ──► collapse Z ──► 2D Hilbert      │
│  curve ──► 1D sequence (preserves locality)    │
└───────────────────┬───────────────────────────┘
                    │
       ┌────────────▼────────────┐
       │  Hierarchical Mamba     │
       │  ┌────────────────────┐ │
       │  │ Mamba block (fine) │ │
       │  │      ↓ downsample  │ │
       │  │ Mamba block (mid)  │ │
       │  │      ↓ downsample  │ │
       │  │ Mamba block(coarse)│ │
       │  │      ↓ upsample    │ │
       │  │ + skip connections │ │
       │  └────────────────────┘ │
       └────────────┬────────────┘
                    │
       ┌────────────▼────────────┐
       │  Local Context Processor │
       │  Overlapping patches +   │
       │  local Mamba + 3D conv   │
       └────────────┬────────────┘
                    │
                    ▼
       ┌─────────────────────────┐
       │  Semantic Occupancy      │
       │  Predictions             │
       └─────────────────────────┘

OccMamba architecture overview

Height-prioritized 2D Hilbert expansion

Hierarchical Mamba module details

Multi-Modal Encoders: - LiDAR branch: Voxelization + sparse convolution encoder - Image branch: ResNet with FPN + view transformation to 3D features - Feature fusion via channel concatenation

Height-Prioritized 2D Hilbert Expansion: The key innovation for adapting Mamba to 3D: first collapse the height dimension (least spatially varying in driving scenes), then apply a 2D Hilbert space-filling curve to the resulting BEV-like representation. This preserves more 3D spatial locality than direct 3D Hilbert curves or axis-aligned flattening.

Hierarchical Mamba Module: Encoder-decoder with multiple resolution levels, each containing Mamba blocks. Downsampling reduces spatial resolution; upsampling with skip connections recovers detail. Global context is captured at coarse levels with linear complexity.

Local Context Processor: Overlapping patch division processes local regions independently with Mamba, then reassembles via 3D convolution. This captures fine-grained semantic details that global processing may miss.

Training loss: Cross-entropy + Lovasz-Softmax + geometric affinity + semantic affinity + depth supervision (5 components).

Results

Benchmark OccMamba Previous SOTA Improvement
OpenOccupancy IoU +5.1% Co-Occ Absolute
OpenOccupancy mIoU +4.3% Co-Occ Absolute
SemanticKITTI SOTA Co-Occ Outperforms
SemanticPOSS SOTA - Highest LiDAR-only
Efficiency (OccMamba-128 vs baseline) Reduction
Training memory -38%
Inference memory -44%
Inference time -65%

Ablation on reordering schemes: - Height-prioritized 2D Hilbert: 25.2% mIoU (best) - 3D Hilbert curve: 24.8% - ZXY sequence: 24.7% - XYZ sequence: 24.5%

Limitations

  • Uses both LiDAR and camera inputs; camera-only performance is not evaluated, limiting applicability to camera-only systems (Tesla-style)
  • Hilbert curve ordering is fixed and handcrafted; a learned serialization might better adapt to scene statistics
  • Mamba's sequential nature limits parallelism during inference compared to transformers, potentially affecting latency on modern GPUs
  • Evaluated on static occupancy; temporal extension (streaming/forecasting) is not explored

Connections