ESC

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Read on arXiv

Overview

GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse 3D Gaussian representations with 2D Vision Foundation Model (VFM) alignment through differentiable splatting: predict a set of 3D Gaussians in a feed-forward pass, splat them back to 2D views, and train by aligning the rendered features with pre-trained foundation model outputs (CLIP, DINO, Metric3D).

This self-supervised approach eliminates the expensive 3D annotation requirement that limits supervised occupancy methods. At inference, semantic labels are assigned via similarity between Gaussian features and text prototype embeddings from the foundation model, enabling open-vocabulary classification of novel categories. GaussTR achieves 12.27 mIoU zero-shot on Occ3D-nuScenes (23% improvement over prior self-supervised methods) while using only 3% of the scene representation parameters compared to voxel-based methods and reducing training time by 40%.

Key Contributions

  • Self-supervised 3D learning via foundation model alignment: Bridges 2D VFM priors to 3D through differentiable Gaussian splatting, eliminating need for 3D annotations
  • Open-vocabulary occupancy prediction: Zero-shot semantic classification via vision-language alignment -- can recognize categories never seen during training
  • Extreme parameter efficiency: Uses only 3% of scene representation parameters compared to competing methods
  • Training efficiency: 40% reduction in training time (~12 hours on 8 A800 GPUs for 1000 scenes)
  • Sparse Gaussian representation: Replaces dense voxel grids with learnable 3D Gaussians, each parameterized by center, scaling, rotation, density, and feature vector

Architecture / Method

GaussTR framework overview

Performance vs efficiency comparison

┌──────────────────────────────────────────────────────────────────┐
                       GaussTR Pipeline                            
                                                                   
  ┌──────────┐    ┌──────────────────────────────┐                 
   Multi-cam │───►│ Pre-trained VFMs                               
   Images         (CLIP, DINO, Metric3D V2)                     
  └──────────┘    └──────────┬───────────────────┘                 
                                                                  
              Multi-scale 2D features + depth maps                 
                                                                  
  ┌────────────────┐                                             
   Learnable                                                    
   Gaussian Queries                                              
  └────────┬───────┘                                             
                                                                 
                                                                 
  ┌─────────────────────────────────────────┐                      
    Transformer Decoder                                           
    ┌───────────────────────────────────┐                        
     Deformable Cross-Attention                                 
     (aggregate 2D features)                                    
    ├───────────────────────────────────┤                        
     Global Self-Attention                                      
     (3D positional encodings)                                  
    ├───────────────────────────────────┤                        
     Gaussian Head (MLP):                                       
     center, scale, rot, density, feat                          
    └───────────────────────────────────┘                        
  └──────────────────┬──────────────────────┘                      
                                                                  
           Predicted 3D Gaussians                                  
                                                                  
    ┌────────────────┴─────────────────┐                           
     (Training)                        (Inference)               
  ┌───────────────────────┐  ┌──────────────────────────┐          
   Diff. Splatting ──►       Text prototype embeddings          
   2D rendered features      (from VFM)                         
                                                               
                                                               
   PCA + Cosine sim loss     Cosine sim ──► semantic             
   vs VFM features           logits (zero-shot)                  
   + Depth loss (SILog)                                         
  └───────────────────────┘                                      
                               Voxelize ──► Occ Grid               
                              └──────────────────────────┘          
└──────────────────────────────────────────────────────────────────┘

Input Processing: - Multi-view images processed through pre-trained VFMs (CLIP, DINO variants, Metric3D V2) - Extraction of multi-scale 2D features and dense depth maps

Gaussian Prediction (Transformer Decoder): - Learnable Gaussian queries decoded via deformable cross-attention for 2D feature aggregation - Global self-attention with 3D positional encodings for inter-Gaussian reasoning - MLP-based Gaussian head predicts: 3D center, scaling, rotation, density, feature vector

Self-Supervised Training: 1. Differentiable splatting renders predicted Gaussians back to 2D views 2. PCA reduces feature dimensionality for tractable comparison 3. Three loss components: feature alignment (cosine similarity with VFM features), depth supervision (SILog + L1 with Metric3D), optional segmentation loss

Open-Vocabulary Inference: - Generate text prototype embeddings from foundation model for target categories - Compute semantic logits via cosine similarity between Gaussian features and text prototypes - Voxelize Gaussians for final occupancy grid output

Results

Metric Value
Zero-shot mIoU (Occ3D-nuScenes) 12.27
Improvement over prior self-supervised +2.33 mIoU (23%)
Training time reduction 40%
Parameter efficiency vs voxel methods 3%
Training duration ~12 hours (8x A800 GPUs)
Spatial coverage 80m x 80m x 6.4m at 0.4m resolution

Foundation model ablation: - Talk2DINO: 12.27 mIoU (best) - FeatUp: 11.70 mIoU

Category strengths: Vehicles, trucks, and man-made structures perform well; small objects and flat surfaces (sidewalks) are challenging due to occlusion and limited geometric signal.

Limitations

  • Zero-shot mIoU of 12.27 remains far below supervised methods (GaussianFormer-2 achieves 20.82); the self-supervised advantage is annotation-free training, not raw performance
  • Dependent on quality of 2D foundation models; performance varies significantly with VFM choice (Talk2DINO vs FeatUp)
  • Small objects and flat surfaces remain challenging -- the Gaussian representation struggles with thin/flat geometry
  • No temporal modeling; single-frame prediction without history aggregation
  • Open-vocabulary capability limited by the foundation model's vocabulary and feature quality

Connections