Occworld Learning A 3D Occupancy World Model For Autonomous Driving

Overview

OccWorld introduces a generative world model that operates in 3D semantic occupancy space, jointly forecasting future scene evolution and ego vehicle trajectories. The key idea is to tokenize 3D occupancy volumes using a VQ-VAE, then train a GPT-like spatial-temporal transformer to autoregressively predict future scene tokens and ego motion. This enables the model to "imagine" future 3D scenes conditioned on ego actions, providing both occupancy forecasting for perception and trajectory evaluation for planning. Unlike detection-based approaches, occupancy representations naturally handle irregular objects (debris, construction equipment) that bounding boxes cannot model. OccWorld achieves competitive planning with UniAD despite requiring significantly less supervision -- no HD maps or instance-level annotations needed.

Note: This is the ORIGINAL OccWorld paper (ECCV 2024, arxiv 2311.16038). A different paper, Drive-OccWorld (AAAI 2025, arxiv 2408.14197), extends this work and is documented separately at Drive Occworld Driving In The Occupancy World.

Key Contributions

3D occupancy as world model substrate: Uses semantic occupancy voxels as the representation for generative world modeling, enabling expressiveness across sensor modalities without instance-level supervision
Scene tokenization via VQ-VAE: Converts continuous 3D occupancy volumes to BEV, then encodes and quantizes them into discrete tokens using a learned codebook
Spatial-temporal GPT transformer: Hierarchical autoregressive model with spatial aggregation within timesteps and temporal causal self-attention across timesteps for joint scene and ego prediction
Minimal supervision: Competitive planning performance without HD maps or instance annotations, unlike UniAD which requires both

Architecture / Method

                         OccWorld Pipeline
                         ─────────────────

  3D Semantic         ┌──────────────────────┐
  Occupancy    ──────►│  Height Compression   │
  (H x W x Z)        │  (3D ──► BEV)         │
                      └──────────┬───────────┘
                                 │
                                 ▼
                      ┌──────────────────────┐
                      │   CNN Encoder         │
                      │   (Downsample BEV)    │
                      └──────────┬───────────┘
                                 │
                                 ▼
                      ┌──────────────────────┐       ┌────────────┐
                      │  Vector Quantization  │◄─────│  Codebook   │
                      │  z = argmin||f - c||  │      └────────────┘
                      └──────────┬───────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                   │
              ▼                  ▼                   ▼
   ┌─────────────────┐  ┌──────────────┐  ┌────────────────┐
   │  CNN Decoder     │  │  Scene       │  │  Ego Tokens    │
   │  (Reconstruct    │  │  Tokens      │  │  (Vehicle      │
   │   3D Occupancy)  │  │              │  │   State)       │
   └─────────────────┘  └──────┬───────┘  └───────┬────────┘
                               │                   │
                               └────────┬──────────┘
                                        ▼
                            ┌───────────────────────┐
                            │  Spatial-Temporal GPT  │
                            │  ┌─────────────────┐  │
                            │  │ Spatial Attn     │  │
                            │  │ (within timestep)│  │
                            │  └────────┬────────┘  │
                            │           ▼           │
                            │  ┌─────────────────┐  │
                            │  │ Temporal Causal  │  │
                            │  │ Attn (across t)  │  │
                            │  └────────┬────────┘  │
                            └───────────┼───────────┘
                               ┌────────┴────────┐
                               ▼                 ▼
                        Future 4D          Ego Trajectory
                        Occupancy          Prediction

OccWorld Overview

Scene Tokenizer (VQ-VAE): - Transforms 3D semantic occupancy into Bird's-Eye-View representation via height compression - CNN encoder extracts downsampled BEV features - Vector quantization maps features to nearest codebook entries: z_ij = argmin_c ||f_ij - c||_2 - Decoder reconstructs full 3D occupancy from quantized BEV tokens

Scene Tokenizer

Generative Transformer: - Processes "world tokens" comprising scene tokens (quantized occupancy) and ego tokens (vehicle state) - Hierarchical multi-scale spatial processing within each timestep captures local and global scene structure - Temporal causal self-attention across timesteps: z_hat_{T+1,j,i} = Attention(z_{T,j,i}, z_{T-1,j,i}, ..., z_{T-t,j,i}) - Autoregressively generates future 4D occupancy and ego displacement

Transformer Architecture

Results

4D Occupancy Forecasting (3-second horizon):

Variant	mIoU	IoU	Supervision
OccWorld-O (ground-truth occ)	17.14%	26.63%	Oracle
OccWorld-D (dense supervision)	8.62%	16.53%	Dense LiDAR
OccWorld-T (sparse LiDAR)	3.56%	8.34%	Sparse LiDAR
OccWorld-S (self-supervised)	0.26%	5.00%	None

Motion Planning:

Method	Avg L2 (m)	L2 @1s	L2 @3s	Supervision
UniAD	1.03	--	--	HD maps + instances
OccWorld-O	1.17	0.43	1.99	Occupancy only
OccWorld-D	1.34	--	--	Dense LiDAR only

OccWorld achieves competitive planning (1.17m vs UniAD 1.03m) despite requiring far less supervision.

Ablation Findings: - Removing spatial attention: mIoU drops from 17.14% to 10.07% - Removing temporal attention: mIoU drops to 8.98% - Eliminating ego joint modeling: L2 error degrades to 5.89m (vs 1.17m), confirming joint ego-scene modeling is essential

Limitations

Self-supervised variant (OccWorld-S) performs poorly (0.26% mIoU), indicating occupancy tokenization still benefits substantially from supervision
Planning results (1.17m L2) trail UniAD -- the occupancy representation is not yet sufficient to close the gap with detection-based scene understanding
Evaluation is open-loop on nuScenes; no closed-loop validation
VQ-VAE codebook size and spatial resolution create information bottlenecks that may limit fine-grained prediction quality
Computational cost of 3D occupancy processing is higher than BEV-only methods

Connections

Directly extended by Drive Occworld Driving In The Occupancy World (Drive-OccWorld, AAAI 2025) which adds semantic-conditional and motion-conditional normalization for improved occupancy forecasting and planning
Related to Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction which reformulates occupancy world modeling using 3D Gaussians instead of voxels
Complements Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation which unifies understanding and generation in a single LLM framework
The VQ-VAE tokenization approach connects to Cosmos World Foundation Model Platform For Physical Ai which also uses tokenization for world modeling
Perception -- occupancy as an alternative to detection for scene understanding
Planning -- world model-based trajectory evaluation
Autonomous Driving -- world models as a key paradigm in Era 3