ESC

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

๐Ÿ“„ Read on arXiv

Overview

OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc, OccFormer, BEVFormer-based pipelines) train feed-forward networks to directly regress voxel labels from multi-camera images, which treats each voxel independently and fails to capture the strong structural priors of real-world 3D scenes -- walls are planar, roads are flat, vehicles have characteristic shapes. OccGen argues that modeling the joint distribution of 3D occupancy with a generative process can exploit these scene-level priors to produce more coherent and complete predictions, especially in occluded or ambiguous regions.

The core idea is to use a diffusion model that operates on a 3D occupancy representation. A multi-modal condition network encodes multi-camera images (and optionally LiDAR) into spatial conditioning features. During inference, OccGen iteratively denoises a randomly initialized 3D occupancy volume using DDIM, progressively refining it toward a plausible scene configuration consistent with the camera observations. A progressive refinement decoder with stacked deformable attention layers produces the final semantic occupancy predictions.

OccGen achieves strong results on the nuScenes-Occupancy benchmark (22.0% mIoU multi-modal, 14.5% camera-only, 16.8% LiDAR-only), demonstrating that the generative formulation improves over discriminative baselines -- particularly for rare and geometrically complex classes where structural priors matter most. The paper also shows that the diffusion framework gracefully integrates multi-modal conditioning (cameras, and optionally LiDAR), and that the iterative refinement produces visually more coherent occupancy maps compared to single-pass methods. A distinctive property of the generative approach is the ability to provide uncertainty estimates alongside predictions.

Key Contributions

  • Generative occupancy formulation: First paper to formulate 3D occupancy prediction as a conditional generative task using score-based diffusion, demonstrating that modeling the joint distribution of voxels captures scene-level structural priors that discriminative methods miss
  • Multi-modal condition network: A flexible BEV-based conditioning architecture that encodes multi-camera images (and optionally LiDAR) into a spatial conditioning signal for the diffusion process, enabling plug-and-play sensor fusion
  • Progressive refinement decoder: Six stacked refinement layers with 3D deformable cross-attention and self-attention, using DDIM with a cosine noise schedule to iteratively denoise the occupancy volume; the denoising process inherently models coarse-to-fine refinement without a separate upsampling stage
  • Progressive denoising for occupancy: Demonstrates that iterative refinement through the diffusion reverse process produces more spatially coherent predictions than single-pass feed-forward methods, with particular gains on geometrically complex and rare classes
  • Uncertainty estimation: As a generative model, OccGen natively provides uncertainty estimates alongside occupancy predictions, a capability discriminative methods cannot offer
  • Strong nuScenes-Occupancy results: 22.0% mIoU (multi-modal), 14.5% (camera-only), 16.8% (LiDAR-only), with relative mIoU improvements of 9.5%, 13.3%, and 6.3% respectively over the prior state-of-the-art

Architecture / Method

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            Conditional Encoder (two-stream)                 โ”‚
โ”‚                                                             โ”‚
โ”‚  Camera: 2D Backbone+FPN โ”€โ”€โ–บ Gumbel-Softmax hard           โ”‚
โ”‚                               2D-to-3D view transform       โ”‚
โ”‚                                              โ”‚              โ”‚
โ”‚  LiDAR:  VoxelNet + 3D sparse conv โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค              โ”‚
โ”‚                                  geometry mask fuse         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚ multi-modal conditioning features
                          โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Progressive Refinement Decoder (DDIM)               โ”‚
โ”‚                                                             โ”‚
โ”‚  Training:  GT occ โ”€โ”€โ–บ add noise (cosine schedule)         โ”‚
โ”‚                                                             โ”‚
โ”‚  Inference: Gaussian noise                                  โ”‚
โ”‚             โ”€โ”€โ–บ [Layer 1: 3D deformable cross-attn +        โ”‚
โ”‚                           self-attn + time embed]           โ”‚
โ”‚             โ”€โ”€โ–บ [Layer 2] โ”€โ”€โ–บ ... โ”€โ”€โ–บ [Layer 6]            โ”‚
โ”‚                 โ–ฒ conditioned on multi-modal features        โ”‚
โ”‚                                                             โ”‚
โ”‚  Loss: cross-entropy + lovรกsz-softmax + affinity + depth    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
                          โ–ผ
              Semantic Occupancy Volume

OccGen's pipeline consists of three main components:

1. Multi-modal Condition Network (Conditional Encoder). A two-stream architecture encodes multi-modal inputs. The camera stream processes images through a pre-trained 2D backbone with Feature Pyramid Network and projects them into 3D using a novel hard 2D-to-3D view transformation based on Gumbel-Softmax for deterministic depth assignment (not the LSS soft-distribution approach). The LiDAR stream uses VoxelNet with 3D sparse convolutions. When both modalities are available, a geometry mask derived from LiDAR is used to refine camera features, enabling cross-modal fusion.

2. Progressive Refinement Decoder. The core generative component applies diffusion denoising using the multi-modal encoded features as conditions. Rather than a 3D U-Net, it uses six stacked refinement layers with 3D deformable cross-attention and self-attention mechanisms that operate on multi-scale noise maps with time embeddings from the diffusion module. DDIM (Denoising Diffusion Implicit Models) with a cosine noise schedule is used for both training and inference. The training loss combines cross-entropy, lovรกsz-softmax, affinity (geometric and semantic), and depth supervision.

3. Inference. OccGen starts from Gaussian noise and runs DDIM reverse diffusion (with asymmetric time intervals, td=1), conditioned on the encoded multi-modal observations, progressively denoising to produce a semantic occupancy volume. The denoising process inherently models coarse-to-fine refinement of the dense 3D occupancy map.

Results

OccGen was evaluated primarily on the nuScenes-Occupancy benchmark:

Method Setting mIoU Type
OpenOccupancy (Baseline) Multi-modal 15.1 Discriminative
CONet Multi-modal 20.1 Discriminative
OccGen Multi-modal 22.0 Generative (diffusion)
C-CONet Camera-only 12.8 Discriminative
C-OccGen Camera-only 14.5 Generative (diffusion)
L-CONet LiDAR-only 15.8 Discriminative
L-OccGen LiDAR-only 16.8 Generative (diffusion)

OccGen also achieves 13.74% mIoU on SemanticKITTI (vs OccFormer 13.46%).

Key findings: - OccGen relatively improves mIoU by 9.5% (multi-modal), 13.3% (camera-only), and 6.3% (LiDAR-only) on the nuScenes-Occupancy benchmark vs. the prior state-of-the-art - The generative formulation provides the largest improvements on rare and geometrically complex classes where structural priors matter most - Multi-modal conditioning (camera + LiDAR) yields additional gains over unimodal, demonstrating the flexibility of the conditioning framework - As a generative model, OccGen can produce uncertainty estimates alongside predictions -- a capability unavailable in discriminative baselines - Comparable inference latency (~357-400ms) to single-forward discriminative methods despite iterative denoising - Qualitative results show visually more coherent and complete occupancy maps, with better completion of occluded regions

Limitations & Open Questions

  • Inference speed: Despite using DDIM and achieving comparable latency to some discriminative baselines (~357-400ms), the iterative denoising process still exceeds real-time requirements for safety-critical autonomous driving deployment
  • Computational cost: The stacked 3D deformable attention refinement layers operating on volumetric representations are memory-intensive, compounding the existing cost of occupancy prediction
  • Limited temporal modeling: OccGen operates on single-frame observations without explicit temporal aggregation; combining diffusion-based occupancy with temporal world models (OccWorld, Drive-OccWorld) is an open direction
  • Scaling to higher resolutions: The coarse-to-fine strategy helps, but scaling diffusion to very fine voxel resolutions (e.g., 0.1m) remains challenging
  • How many denoising steps are truly needed? Recent work on truncated diffusion (DiffusionDrive) and single-step flow matching (GoalFlow) suggests that full multi-step denoising may be unnecessary -- could occupancy prediction benefit from similar truncation?

Connections

Related papers in the wiki: