OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
๐ Read on arXiv
Overview
OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc, OccFormer, BEVFormer-based pipelines) train feed-forward networks to directly regress voxel labels from multi-camera images, which treats each voxel independently and fails to capture the strong structural priors of real-world 3D scenes -- walls are planar, roads are flat, vehicles have characteristic shapes. OccGen argues that modeling the joint distribution of 3D occupancy with a generative process can exploit these scene-level priors to produce more coherent and complete predictions, especially in occluded or ambiguous regions.
The core idea is to use a diffusion model that operates on a 3D occupancy representation. A multi-modal condition network encodes multi-camera images (and optionally LiDAR) into spatial conditioning features. During inference, OccGen iteratively denoises a randomly initialized 3D occupancy volume using DDIM, progressively refining it toward a plausible scene configuration consistent with the camera observations. A progressive refinement decoder with stacked deformable attention layers produces the final semantic occupancy predictions.
OccGen achieves strong results on the nuScenes-Occupancy benchmark (22.0% mIoU multi-modal, 14.5% camera-only, 16.8% LiDAR-only), demonstrating that the generative formulation improves over discriminative baselines -- particularly for rare and geometrically complex classes where structural priors matter most. The paper also shows that the diffusion framework gracefully integrates multi-modal conditioning (cameras, and optionally LiDAR), and that the iterative refinement produces visually more coherent occupancy maps compared to single-pass methods. A distinctive property of the generative approach is the ability to provide uncertainty estimates alongside predictions.
Key Contributions
- Generative occupancy formulation: First paper to formulate 3D occupancy prediction as a conditional generative task using score-based diffusion, demonstrating that modeling the joint distribution of voxels captures scene-level structural priors that discriminative methods miss
- Multi-modal condition network: A flexible BEV-based conditioning architecture that encodes multi-camera images (and optionally LiDAR) into a spatial conditioning signal for the diffusion process, enabling plug-and-play sensor fusion
- Progressive refinement decoder: Six stacked refinement layers with 3D deformable cross-attention and self-attention, using DDIM with a cosine noise schedule to iteratively denoise the occupancy volume; the denoising process inherently models coarse-to-fine refinement without a separate upsampling stage
- Progressive denoising for occupancy: Demonstrates that iterative refinement through the diffusion reverse process produces more spatially coherent predictions than single-pass feed-forward methods, with particular gains on geometrically complex and rare classes
- Uncertainty estimation: As a generative model, OccGen natively provides uncertainty estimates alongside occupancy predictions, a capability discriminative methods cannot offer
- Strong nuScenes-Occupancy results: 22.0% mIoU (multi-modal), 14.5% (camera-only), 16.8% (LiDAR-only), with relative mIoU improvements of 9.5%, 13.3%, and 6.3% respectively over the prior state-of-the-art
Architecture / Method
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Conditional Encoder (two-stream) โ
โ โ
โ Camera: 2D Backbone+FPN โโโบ Gumbel-Softmax hard โ
โ 2D-to-3D view transform โ
โ โ โ
โ LiDAR: VoxelNet + 3D sparse conv โโโโโโโโโโค โ
โ geometry mask fuse โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ multi-modal conditioning features
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Progressive Refinement Decoder (DDIM) โ
โ โ
โ Training: GT occ โโโบ add noise (cosine schedule) โ
โ โ
โ Inference: Gaussian noise โ
โ โโโบ [Layer 1: 3D deformable cross-attn + โ
โ self-attn + time embed] โ
โ โโโบ [Layer 2] โโโบ ... โโโบ [Layer 6] โ
โ โฒ conditioned on multi-modal features โ
โ โ
โ Loss: cross-entropy + lovรกsz-softmax + affinity + depth โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Semantic Occupancy Volume
OccGen's pipeline consists of three main components:
1. Multi-modal Condition Network (Conditional Encoder). A two-stream architecture encodes multi-modal inputs. The camera stream processes images through a pre-trained 2D backbone with Feature Pyramid Network and projects them into 3D using a novel hard 2D-to-3D view transformation based on Gumbel-Softmax for deterministic depth assignment (not the LSS soft-distribution approach). The LiDAR stream uses VoxelNet with 3D sparse convolutions. When both modalities are available, a geometry mask derived from LiDAR is used to refine camera features, enabling cross-modal fusion.
2. Progressive Refinement Decoder. The core generative component applies diffusion denoising using the multi-modal encoded features as conditions. Rather than a 3D U-Net, it uses six stacked refinement layers with 3D deformable cross-attention and self-attention mechanisms that operate on multi-scale noise maps with time embeddings from the diffusion module. DDIM (Denoising Diffusion Implicit Models) with a cosine noise schedule is used for both training and inference. The training loss combines cross-entropy, lovรกsz-softmax, affinity (geometric and semantic), and depth supervision.
3. Inference. OccGen starts from Gaussian noise and runs DDIM reverse diffusion (with asymmetric time intervals, td=1), conditioned on the encoded multi-modal observations, progressively denoising to produce a semantic occupancy volume. The denoising process inherently models coarse-to-fine refinement of the dense 3D occupancy map.
Results
OccGen was evaluated primarily on the nuScenes-Occupancy benchmark:
| Method | Setting | mIoU | Type |
|---|---|---|---|
| OpenOccupancy (Baseline) | Multi-modal | 15.1 | Discriminative |
| CONet | Multi-modal | 20.1 | Discriminative |
| OccGen | Multi-modal | 22.0 | Generative (diffusion) |
| C-CONet | Camera-only | 12.8 | Discriminative |
| C-OccGen | Camera-only | 14.5 | Generative (diffusion) |
| L-CONet | LiDAR-only | 15.8 | Discriminative |
| L-OccGen | LiDAR-only | 16.8 | Generative (diffusion) |
OccGen also achieves 13.74% mIoU on SemanticKITTI (vs OccFormer 13.46%).
Key findings: - OccGen relatively improves mIoU by 9.5% (multi-modal), 13.3% (camera-only), and 6.3% (LiDAR-only) on the nuScenes-Occupancy benchmark vs. the prior state-of-the-art - The generative formulation provides the largest improvements on rare and geometrically complex classes where structural priors matter most - Multi-modal conditioning (camera + LiDAR) yields additional gains over unimodal, demonstrating the flexibility of the conditioning framework - As a generative model, OccGen can produce uncertainty estimates alongside predictions -- a capability unavailable in discriminative baselines - Comparable inference latency (~357-400ms) to single-forward discriminative methods despite iterative denoising - Qualitative results show visually more coherent and complete occupancy maps, with better completion of occluded regions
Limitations & Open Questions
- Inference speed: Despite using DDIM and achieving comparable latency to some discriminative baselines (~357-400ms), the iterative denoising process still exceeds real-time requirements for safety-critical autonomous driving deployment
- Computational cost: The stacked 3D deformable attention refinement layers operating on volumetric representations are memory-intensive, compounding the existing cost of occupancy prediction
- Limited temporal modeling: OccGen operates on single-frame observations without explicit temporal aggregation; combining diffusion-based occupancy with temporal world models (OccWorld, Drive-OccWorld) is an open direction
- Scaling to higher resolutions: The coarse-to-fine strategy helps, but scaling diffusion to very fine voxel resolutions (e.g., 0.1m) remains challenging
- How many denoising steps are truly needed? Recent work on truncated diffusion (DiffusionDrive) and single-step flow matching (GoalFlow) suggests that full multi-step denoising may be unnecessary -- could occupancy prediction benefit from similar truncation?
Connections
Related papers in the wiki:
- Surroundocc Multi Camera 3D Occupancy Prediction For Autonomous Driving -- foundational discriminative occupancy method that OccGen builds upon and compares against
- Occformer Dual Path Transformer For Vision Based 3D Semantic Occupancy Prediction -- efficient dual-path transformer baseline for discriminative occupancy
- Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction -- alternative efficient occupancy representation using sparse Gaussians (ECCV 2024 peer)
- Occworld Learning A 3D Occupancy World Model For Autonomous Driving -- generative occupancy world model using VQ-VAE + GPT (ECCV 2024 peer); generative but for forecasting rather than perception
- Drive Occworld Driving In The Occupancy World -- 4D occupancy forecasting world model extending OccWorld ideas
- Bevformer Learning Birds Eye View Representation From Multi Camera Images Via Spatiotemporal Transformers -- BEV feature extraction backbone used in OccGen's condition network
- Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D -- LSS soft-depth approach that OccGen contrasts against; OccGen replaces it with a Gumbel-Softmax hard 2D-to-3D view transform for deterministic depth assignment
- Denoising Diffusion Probabilistic Models -- foundational DDPM framework that OccGen adapts to 3D occupancy
- Bevdiffuser Plug And Play Diffusion Model For Bev Denoising -- related use of diffusion for BEV feature denoising (training-only), complementary approach
- Flashocc Fast And Memory Efficient Occupancy Prediction Via Channel To Height Plugin -- efficiency-focused occupancy method, highlighting the speed gap OccGen needs to close
- Occmamba Semantic Occupancy Prediction With State Space Models -- Mamba-based occupancy with linear complexity, another efficiency contrast