Pseudo-Simulation for Autonomous Driving (NAVSIM v2)

:page_facing_up: Read on arXiv

Overview

Pseudo-Simulation by Cao, Hallgarten et al. (Tubingen / Shanghai AI Lab / NVIDIA / Stanford, CoRL 2025) introduces a novel evaluation paradigm for autonomous driving that bridges the gap between open-loop evaluation (fast but unreliable) and closed-loop simulation (accurate but expensive). The key insight is that you can pre-generate diverse synthetic observations from real driving data using 3D Gaussian Splatting, creating a bank of plausible future states that approximate what the ego vehicle would observe under different actions -- without requiring a full online simulator.

NAVSIM v2 implements this pseudo-simulation framework as the de facto standard benchmark for end-to-end autonomous driving evaluation. The benchmark reveals that pseudo-simulation correlates much better with closed-loop simulation (R^2=0.8) than the best existing open-loop metric (R^2=0.7), while being orders of magnitude cheaper than full simulation. It also uncovers previously unknown failure modes in popular AV algorithms that open-loop metrics miss entirely.

Key Contributions

Pseudo-simulation paradigm: A new evaluation approach that operates on real datasets but augments them with pre-generated synthetic observations via 3D Gaussian Splatting, combining the realism of open-loop data with the feedback sensitivity of closed-loop evaluation
3D Gaussian Splatting for driving scenes: Specializes Gaussian Splatting for outdoor driving by pre-generating diverse observations varying in position, heading, and speed from initial real-world observations
Proximity-based importance weighting: Assigns higher importance to synthetic observations that best match the AV's likely future behavior, approximating the closed-loop compounding error effect
NAVSIM v2 benchmark: Public leaderboard with challenging driving scenarios from nuPlan (navhard subset: 450 Stage 1 + 5462 Stage 2 observations), establishing a community standard for E2E driving evaluation
Failure mode discovery: Reveals previously unknown failure modes in popular methods that open-loop evaluation misses

Architecture / Method

                    Phase 1: Offline Generation
┌──────────────────┐    3D Gaussian     ┌──────────────────────┐
│  Real Driving     │    Splatting       │  Observation Bank     │
│  Observation      │──────────────────►│  (varied position,    │
│  (cameras + ego)  │   Re-render from   │   heading, speed)     │
└──────────────────┘   novel viewpoints  └──────────┬───────────┘
                                                    │ cached
                    Phase 2: Online Evaluation       │
┌──────────────────┐                                │
│  Current          │◄──────────────────────────────┘
│  Observation      │         proximity-based
└────────┬─────────┘         selection
         │
         ▼
┌──────────────────┐    select closest   ┌──────────────────────┐
│  AV Model         │    pre-generated    │  Next Observation     │
│  predicts action  │───────────────────►│  (from bank)          │
└──────────────────┘    observation       └──────────┬───────────┘
                                                     │
                                          ┌──────────▼───────────┐
                                          │  Repeat for T steps   │
                                          │  (captures compound-  │
                                          │   ing errors)         │
                                          └──────────────────────┘

The pseudo-simulation pipeline operates in two phases:

Phase 1 -- Observation Generation (Offline): From each real-world driving observation (multi-camera images + ego state), the system generates a bank of synthetic observations using 3D Gaussian Splatting. The Gaussians are fit to the real driving scene and then re-rendered from novel viewpoints corresponding to different ego positions, headings, and speeds. This produces a set of plausible future observations the ego might encounter under different actions. This phase is done once and cached.

Phase 2 -- Evaluation (Online): The AV model is evaluated in a loop: 1. Given an observation, the AV predicts an action (trajectory) 2. The system selects the pre-generated synthetic observation that best matches where the AV's action would take it, using proximity-based weighting 3. This synthetic observation becomes the input for the next step 4. The process repeats, capturing compounding errors that open-loop evaluation misses

The proximity-based weighting scheme is critical: rather than requiring exact trajectory matching (which would need infinite pre-generated observations), the system uses a Gaussian-weighted average with kernel variance σ² (w⁽ⁱ⁾ = exp(−‖xⁱ − x̂‖²/2σ²), optimal at σ²=0.1) to prioritize synthetic observations closest to the Stage 1 endpoint, creating a soft approximation of the true closed-loop dynamics.

NAVSIM v2 Benchmark Design: The benchmark uses a curated subset of nuPlan called "navhard" -- challenging scenarios evaluated over two stages. The primary leaderboard metric is EPDMS (Extended PDM Score), adapted to pseudo-simulation scoring.

Results

Correlation with Closed-Loop Simulation

Evaluation Method	R^2 with Closed-Loop
Standard open-loop (L2 distance)	0.3-0.5
Best open-loop metric (PDM-Open)	0.7
Pseudo-simulation (NAVSIM v2)	0.8

NAVSIM v2 Leaderboard

The v2 benchmark uses EPDMS (not PDMS), a different metric reflecting pseudo-simulation-based scoring. Scores are not directly comparable to NAVSIM v1 PDMS values.

The paper establishes a public leaderboard for the community. For current rankings and submitted method scores, see the online leaderboard (not reproduced here to avoid stale snapshot data).

Pseudo-simulation reveals that several methods that score well on open-loop metrics fail in pseudo-simulation due to compounding errors in dynamic scenarios.

Limitations

Pre-generated observations cannot capture all possible future states; rare or extreme deviations from the observation bank may not be well represented
3D Gaussian Splatting rendering quality degrades at large viewpoint changes from the original observation
Does not model reactive agents (other vehicles, pedestrians do not respond to the ego vehicle's actions)
Requires high-quality 3D reconstruction of driving scenes, which depends on sensor quality and scene complexity
R^2=0.8 correlation with closed-loop is strong but not perfect; safety-critical edge cases may still be missed

Connections

Supersedes the original NAVSIM (2406.15349, NeurIPS 2024) as the standard E2E driving benchmark
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation reports 92.0 PDMS on NAVSIM v1
Carla An Open Urban Driving Simulator provides full closed-loop simulation that NAVSIM v2 approximates more cheaply
Builds on the nuPlan dataset which extends Nuscenes A Multimodal Dataset For Autonomous Driving
Evaluates methods like Planning Oriented Autonomous Driving (UniAD) and Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving (TransFuser)
Gaussian Splatting rendering connects to the perception methods in Gaussianocc Fully Self Supervised 3D Occupancy Estimation With Gaussian Splatting and Gaussrender Learning 3D Occupancy With Gaussian Rendering