Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation

Overview

HERMES tackles a fundamental limitation in autonomous driving: existing systems treat 3D scene understanding and future scene generation as separate problems. Driving World Models excel at generating future scene predictions but function as black boxes unable to explain reasoning, while Large Language Models applied to driving provide rich interpretability but cannot predict future 3D environment states. This separation creates a critical gap in systems needing comprehensive situational awareness.

HERMES introduces a unified framework built around three key components: a Bird's-Eye View (BEV) tokenizer, a Large Language Model backbone, and a novel "world queries" mechanism for knowledge transfer between understanding and generation tasks. The BEV tokenizer compresses multi-view camera features into a manageable spatial representation, while world queries enable the LLM to share knowledge between textual scene understanding and 3D future prediction within a single forward pass.

The system achieves a 32.4% reduction in Chamfer Distance compared to ViDAR for future point cloud prediction, while outperforming OmniDrive by 8.0% on CIDEr for scene captioning. Scaling experiments from 0.8B to 3.8B parameters show consistent improvements, suggesting further gains with larger models.

Key Contributions

First unified framework that simultaneously performs 3D scene understanding (captioning, QA) and future scene generation (point cloud prediction) within a single LLM (InternVL2)
Novel "world queries" mechanism enabling cross-task knowledge transfer between understanding and generation through LLM causal attention
BEV tokenization approach that compresses multi-view camera features into LLM-compatible spatial representations using BEVFormer v2
Differentiable volume rendering pipeline (SDF-based, similar to NeRF) for converting BEV features into predicted 3D point clouds
Demonstrated that unification improves generation quality over separated approaches, validating true synergy between understanding and generation

Architecture / Method

┌──────────────────────────────────────────────────────────────────┐
│                      HERMES ARCHITECTURE                         │
│                                                                  │
│  Multi-View Cameras                                              │
│  ┌─────┐┌─────┐┌─────┐                                         │
│  │Cam 1││Cam 2││Cam N│                                          │
│  └──┬──┘└──┬──┘└──┬──┘                                         │
│     └──────┼──────┘                                              │
│            ▼                                                     │
│  ┌──────────────────┐     ┌──────────────────────┐              │
│  │  CLIP ConvNeX-L  │────►│  BEVFormer v2        │              │
│  │  (frozen)         │     │  (BEV Tokenizer)     │              │
│  └──────────────────┘     └─────────┬────────────┘              │
│                                     │ BEV Features               │
│                           ┌─────────┴──────────┐                │
│                           ▼                    ▼                 │
│                   ┌──────────────┐    ┌────────────────┐        │
│                   │ Down-sample  │    │ Max-Pool ──►   │        │
│                   │ + Project    │    │ World Queries  │        │
│                   └──────┬───────┘    │ + Ego-Motion   │        │
│                          │            │ + Frame Embed  │        │
│                          │            └───────┬────────┘        │
│                          ▼                    ▼                  │
│               ┌──────────────────────────────────────┐          │
│               │     LLM Backbone (InternVL2)          │          │
│               │  [BEV tokens | Text tokens | World Q] │          │
│               │    Causal Attention ──► Knowledge      │          │
│               │    Transfer (understanding ─► gen.)    │          │
│               └────────┬──────────────┬───────────────┘          │
│                        │              │                           │
│              ┌─────────▼──┐   ┌──────▼──────────────┐           │
│              │ Text Output │   │ Current-to-Future    │           │
│              │ (Captions,  │   │ Cross-Attention      │           │
│              │  QA)        │   │      ▼               │           │
│              └────────────┘   │ Volume Rendering     │           │
│                               │ (SDF-based, NeRF)    │           │
│                               │      ▼               │           │
│                               │ 3D Point Cloud       │           │
│                               │ Prediction           │           │
│                               └─────────────────────┘           │
└──────────────────────────────────────────────────────────────────┘

Architecture

The architecture processes multi-view camera images through a pre-trained CLIP ConvNeX-L encoder, aggregating features into BEV representation via BEVFormer v2. BEV features are compressed through down-sampling and projection to fit within LLM context limits. The LLM backbone is InternVL2, evaluated at scales from 0.8B to 3.8B parameters.

World Queries Mechanism: - Initialized by max-pooling raw BEV features to capture salient scene information - Enriched with ego-motion conditions and frame embeddings for controllable future generation - Appended to LLM input alongside BEV features and text tokens - LLM causal attention allows queries to access world knowledge from textual understanding - A "current-to-future link" module uses cross-attention to inject enriched knowledge into future BEV representations

Volume Rendering: HERMES converts BEV features into 3D point clouds using differentiable volume rendering. It predicts Signed Distance Function (SDF) values along sampled rays, similar to NeRF, producing dense 3D scene predictions.

Training: Multi-stage approach with loss $L = L_N + 10L_D$ where $L_N$ is next-token prediction loss and $L_D$ is L1 depth loss against ground-truth LiDAR. Stages progress through tokenizer pre-training, BEV-text alignment, and final unification.

Results

Task	Metric	HERMES	Previous SOTA	Improvement
Future Point Cloud	Chamfer Distance	--	ViDAR	-32.4%
Scene Captioning	CIDEr	--	OmniDrive	+8.0%

Ablation Insights: - World queries reduce Chamfer Distance by 10% for 3-second predictions - Processing queries through LLM (vs. bypass) further improves generation, confirming knowledge transfer - Unified approach outperforms separated baselines where understanding and generation share features but use separate models - Consistent scaling improvements from 0.8B to 3.8B parameters

The system demonstrates controllable generation, conditioning future predictions on different ego-motion commands ("stop", "turn right"), and provides contextually rich scene descriptions alongside 3D predictions.

Limitations & Open Questions

Relies on BEVFormer v2 as a frozen perception backbone, inheriting its failure modes in adverse conditions
Volume rendering quality is limited by BEV resolution; fine-grained details (small objects, distant regions) may be lost
Whether the unified approach scales to longer prediction horizons (beyond 3 seconds) remains untested

Connections

Bevformer Learning Birds Eye View Representation From Multi Camera Images Via Spatiotemporal Transformers -- BEVFormer provides the spatial feature backbone
Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model -- Complementary BEV world model approach focused on trajectory evaluation
Learning Transferable Visual Models From Natural Language Supervision -- CLIP encoder used for visual feature extraction
Planning Oriented Autonomous Driving -- UniAD's joint training philosophy extended to world modeling
Nuscenes A Multimodal Dataset For Autonomous Driving -- Primary evaluation dataset