GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Overview

GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse 3D Gaussian representations with 2D Vision Foundation Model (VFM) alignment through differentiable splatting: predict a set of 3D Gaussians in a feed-forward pass, splat them back to 2D views, and train by aligning the rendered features with pre-trained foundation model outputs (CLIP, DINO, Metric3D).

This self-supervised approach eliminates the expensive 3D annotation requirement that limits supervised occupancy methods. At inference, semantic labels are assigned via similarity between Gaussian features and text prototype embeddings from the foundation model, enabling open-vocabulary classification of novel categories. GaussTR achieves 12.27 mIoU zero-shot on Occ3D-nuScenes (23% improvement over prior self-supervised methods) while using only 3% of the scene representation parameters compared to voxel-based methods and reducing training time by 40%.

Key Contributions

Self-supervised 3D learning via foundation model alignment: Bridges 2D VFM priors to 3D through differentiable Gaussian splatting, eliminating need for 3D annotations
Open-vocabulary occupancy prediction: Zero-shot semantic classification via vision-language alignment -- can recognize categories never seen during training
Extreme parameter efficiency: Uses only 3% of scene representation parameters compared to competing methods
Training efficiency: 40% reduction in training time (~12 hours on 8 A800 GPUs for 1000 scenes)
Sparse Gaussian representation: Replaces dense voxel grids with learnable 3D Gaussians, each parameterized by center, scaling, rotation, density, and feature vector

Architecture / Method

GaussTR framework overview

Performance vs efficiency comparison

┌──────────────────────────────────────────────────────────────────┐
│                       GaussTR Pipeline                            │
│                                                                   │
│  ┌──────────┐    ┌──────────────────────────────┐                 │
│  │ Multi-cam │───►│ Pre-trained VFMs              │                 │
│  │ Images    │    │ (CLIP, DINO, Metric3D V2)    │                 │
│  └──────────┘    └──────────┬───────────────────┘                 │
│                             │                                     │
│              Multi-scale 2D features + depth maps                 │
│                             │                                     │
│  ┌────────────────┐        │                                     │
│  │ Learnable       │        │                                     │
│  │ Gaussian Queries │        │                                     │
│  └────────┬───────┘        │                                     │
│           │                 │                                     │
│           ▼                 ▼                                     │
│  ┌─────────────────────────────────────────┐                      │
│  │  Transformer Decoder                     │                      │
│  │  ┌───────────────────────────────────┐  │                      │
│  │  │ Deformable Cross-Attention         │  │                      │
│  │  │ (aggregate 2D features)            │  │                      │
│  │  ├───────────────────────────────────┤  │                      │
│  │  │ Global Self-Attention              │  │                      │
│  │  │ (3D positional encodings)          │  │                      │
│  │  ├───────────────────────────────────┤  │                      │
│  │  │ Gaussian Head (MLP):               │  │                      │
│  │  │ center, scale, rot, density, feat  │  │                      │
│  │  └───────────────────────────────────┘  │                      │
│  └──────────────────┬──────────────────────┘                      │
│                     │                                             │
│           Predicted 3D Gaussians                                  │
│                     │                                             │
│    ┌────────────────┴─────────────────┐                           │
│    ▼ (Training)                       ▼ (Inference)               │
│  ┌───────────────────────┐  ┌──────────────────────────┐          │
│  │ Diff. Splatting ──►    │  │ Text prototype embeddings│          │
│  │ 2D rendered features   │  │ (from VFM)               │          │
│  │      │                 │  │      │                    │          │
│  │      ▼                 │  │      ▼                    │          │
│  │ PCA + Cosine sim loss  │  │ Cosine sim ──► semantic   │          │
│  │ vs VFM features        │  │ logits (zero-shot)        │          │
│  │ + Depth loss (SILog)   │  │      │                    │          │
│  └───────────────────────┘  │      ▼                    │          │
│                              │ Voxelize ──► Occ Grid     │          │
│                              └──────────────────────────┘          │
└──────────────────────────────────────────────────────────────────┘

Input Processing: - Multi-view images processed through pre-trained VFMs (CLIP, DINO variants, Metric3D V2) - Extraction of multi-scale 2D features and dense depth maps

Gaussian Prediction (Transformer Decoder): - Learnable Gaussian queries decoded via deformable cross-attention for 2D feature aggregation - Global self-attention with 3D positional encodings for inter-Gaussian reasoning - MLP-based Gaussian head predicts: 3D center, scaling, rotation, density, feature vector

Self-Supervised Training: 1. Differentiable splatting renders predicted Gaussians back to 2D views 2. PCA reduces feature dimensionality for tractable comparison 3. Three loss components: feature alignment (cosine similarity with VFM features), depth supervision (SILog + L1 with Metric3D), optional segmentation loss

Open-Vocabulary Inference: - Generate text prototype embeddings from foundation model for target categories - Compute semantic logits via cosine similarity between Gaussian features and text prototypes - Voxelize Gaussians for final occupancy grid output

Results

Metric	Value
Zero-shot mIoU (Occ3D-nuScenes)	12.27
Improvement over prior self-supervised	+2.33 mIoU (23%)
Training time reduction	40%
Parameter efficiency vs voxel methods	3%
Training duration	~12 hours (8x A800 GPUs)
Spatial coverage	80m x 80m x 6.4m at 0.4m resolution

Foundation model ablation: - Talk2DINO: 12.27 mIoU (best) - FeatUp: 11.70 mIoU

Category strengths: Vehicles, trucks, and man-made structures perform well; small objects and flat surfaces (sidewalks) are challenging due to occlusion and limited geometric signal.

Limitations

Zero-shot mIoU of 12.27 remains far below supervised methods (GaussianFormer-2 achieves 20.82); the self-supervised advantage is annotation-free training, not raw performance
Dependent on quality of 2D foundation models; performance varies significantly with VFM choice (Talk2DINO vs FeatUp)
Small objects and flat surfaces remain challenging -- the Gaussian representation struggles with thin/flat geometry
No temporal modeling; single-frame prediction without history aggregation
Open-vocabulary capability limited by the foundation model's vocabulary and feature quality

Connections

Gaussianformer 2 Probabilistic Gaussian Superposition For Efficient 3D Occupancy Prediction -- Supervised Gaussian occupancy; GaussTR is the self-supervised counterpart
Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction -- Both use 3D Gaussians for scene representation; GaussianWorld adds temporal prediction
Gaussianlss Toward Real World Bev Perception With Depth Uncertainty Via Gaussian Splatting -- Gaussian Splatting for BEV; GaussTR extends to full 3D with foundation model alignment
Learning Transferable Visual Models From Natural Language Supervision -- CLIP provides the vision-language alignment backbone used in GaussTR
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation -- Both pursue self-supervised/annotation-free approaches to driving perception
Perception -- Advances self-supervised 3D occupancy prediction
Autonomous Driving -- Annotation-free perception for scalable driving systems