ESC

📄 Read on arXiv

Overview

GaussianFormer introduces a fundamentally different scene representation for 3D semantic occupancy prediction: instead of dense voxel grids, scenes are modeled as sparse sets of 3D semantic Gaussians. Each Gaussian encodes position (mean), shape (covariance), and semantic features, enabling adaptive modeling where large Gaussians cover empty regions and small Gaussians capture fine object details. A transformer-based decoder iteratively refines Gaussian properties through self-encoding (local 3D sparse convolutions) and cross-attention to multi-view image features. The sparse Gaussians are then splatted into dense voxel occupancy predictions via efficient Gaussian-to-voxel aggregation. The key result: GaussianFormer uses only 17.8% - 24.8% of the memory required by existing methods, trading modest accuracy (~2% mIoU) for dramatic efficiency gains.

Key Contributions

  • Sparse 3D semantic Gaussian representation: Replaces dense voxel grids with adaptive Gaussians that allocate representational capacity according to scene complexity
  • Transformer-based Gaussian refinement: Iterative decoder with 3D sparse convolution self-encoding and deformable cross-attention to multi-view image features across 6 refinement blocks
  • Gaussian-to-voxel splatting: Efficient conversion from sparse Gaussians to dense occupancy by aggregating weighted semantic contributions from nearby Gaussians
  • Memory efficiency: 5-6x memory reduction vs. dense voxel methods with competitive accuracy

Architecture / Method

Overview

Representation Comparison

Architecture

┌──────────────────────────────────────────────────────────────────┐
                    GaussianFormer Pipeline                        
                                                                   
  ┌──────────┐    ┌────────────────┐                               
   Surround  │───►│ ResNet-50 +    │──► Multi-scale 2D features   
   Cameras        FPN                                           
  └──────────┘    └───────┬────────┘                               
                                                                  
  ┌────────────────────┐                                          
   Learnable Gaussian                                            
   Queries (20K)                                                 
   (mean, cov, sem)                                              
  └─────────┬──────────┘                                          
                                                                  
                                                                  
  ┌─────────────────────────────────────────┐                      
    Iterative Decoder (x6 blocks)                                 
    ┌────────────────────────────────────┐                        
     1. Self-encoding: 3D sparse conv                           
        (local Gaussian interactions)                            
    ├────────────────────────────────────┤                        
     2. Image cross-attention:                                  
        3D ref pts ──► deformable attn                          
        on multi-view feature maps                              
    ├────────────────────────────────────┤                        
     3. MLP refinement:                                         
        Update mean, cov, semantics                             
    └────────────────────────────────────┘                        
  └─────────────────────┬───────────────────┘                      
                                                                  
              Refined 3D Gaussians                                 
                                                                  
  ┌─────────────────────▼───────────────────┐                      
    Gaussian-to-Voxel Splatting                                   
    Each voxel aggregates weighted semantic                       
    contributions from nearby Gaussians                           
    (weight = Gaussian PDF at voxel center)                       
  └─────────────────────┬───────────────────┘                      
                                                                  
              Dense Occupancy Grid                                 
└──────────────────────────────────────────────────────────────────┘

The pipeline:

  1. Image Encoder: ResNet-50 with FPN extracts multi-scale features from surround-view cameras
  2. Gaussian Initialization: Learnable 3D Gaussian parameters (mean, covariance, semantic logits) initialized in the scene volume
  3. Iterative Refinement (6 decoder blocks): - Self-encoding: 3D sparse convolutions capture local Gaussian-to-Gaussian interactions - Image cross-attention: 3D reference points projected onto multi-view feature maps via deformable attention - Refinement: MLP updates mean, covariance, and semantic logits
  4. Gaussian-to-Voxel Splatting: Each voxel aggregates weighted semantic contributions from Gaussians whose influence reaches it, with weights determined by Gaussian probability density at the voxel center

Splatting Process

Results

Method mIoU Memory (vs. GaussianFormer)
SurroundOcc 20.30% not in paper efficiency table
OccFormer 19.03% not in paper efficiency table
GaussianFormer 19.10% 1.00x (baseline)

Qualitative Results

GaussianFormer also achieves 18.1% mAP on 3D detection as a byproduct. The Gaussian representation naturally provides object-level grouping without explicit detection heads.

Limitations

  • ~2% mIoU gap vs. dense methods (SurroundOcc, OccFormer) -- the efficiency comes at an accuracy cost
  • Gaussian initialization is learned but static; adaptive initialization based on scene complexity could improve results
  • Evaluated primarily on nuScenes and KITTI-360; generalization to other domains not demonstrated
  • The splatting operation assumes Gaussians have limited spatial extent; very large objects may require many Gaussians

Connections