Pseudo-Simulation for Autonomous Driving (NAVSIM v2)
:page_facing_up: Read on arXiv
Overview
Pseudo-Simulation by Cao, Hallgarten et al. (Tubingen / Shanghai AI Lab / NVIDIA / Stanford, CoRL 2025) introduces a novel evaluation paradigm for autonomous driving that bridges the gap between open-loop evaluation (fast but unreliable) and closed-loop simulation (accurate but expensive). The key insight is that you can pre-generate diverse synthetic observations from real driving data using 3D Gaussian Splatting, creating a bank of plausible future states that approximate what the ego vehicle would observe under different actions -- without requiring a full online simulator.
NAVSIM v2 implements this pseudo-simulation framework as the de facto standard benchmark for end-to-end autonomous driving evaluation. The benchmark reveals that pseudo-simulation correlates much better with closed-loop simulation (R^2=0.8) than the best existing open-loop metric (R^2=0.7), while being orders of magnitude cheaper than full simulation. It also uncovers previously unknown failure modes in popular AV algorithms that open-loop metrics miss entirely.
Key Contributions
- Pseudo-simulation paradigm: A new evaluation approach that operates on real datasets but augments them with pre-generated synthetic observations via 3D Gaussian Splatting, combining the realism of open-loop data with the feedback sensitivity of closed-loop evaluation
- 3D Gaussian Splatting for driving scenes: Specializes Gaussian Splatting for outdoor driving by pre-generating diverse observations varying in position, heading, and speed from initial real-world observations
- Proximity-based importance weighting: Assigns higher importance to synthetic observations that best match the AV's likely future behavior, approximating the closed-loop compounding error effect
- NAVSIM v2 benchmark: Public leaderboard with challenging driving scenarios from nuPlan (navhard subset: 450 Stage 1 + 5462 Stage 2 observations), establishing a community standard for E2E driving evaluation
- Failure mode discovery: Reveals previously unknown failure modes in popular methods that open-loop evaluation misses
Architecture / Method
Phase 1: Offline Generation
┌──────────────────┐ 3D Gaussian ┌──────────────────────┐
│ Real Driving │ Splatting │ Observation Bank │
│ Observation │──────────────────►│ (varied position, │
│ (cameras + ego) │ Re-render from │ heading, speed) │
└──────────────────┘ novel viewpoints └──────────┬───────────┘
│ cached
Phase 2: Online Evaluation │
┌──────────────────┐ │
│ Current │◄──────────────────────────────┘
│ Observation │ proximity-based
└────────┬─────────┘ selection
│
▼
┌──────────────────┐ select closest ┌──────────────────────┐
│ AV Model │ pre-generated │ Next Observation │
│ predicts action │───────────────────►│ (from bank) │
└──────────────────┘ observation └──────────┬───────────┘
│
┌──────────▼───────────┐
│ Repeat for T steps │
│ (captures compound- │
│ ing errors) │
└──────────────────────┘
The pseudo-simulation pipeline operates in two phases:
Phase 1 -- Observation Generation (Offline): From each real-world driving observation (multi-camera images + ego state), the system generates a bank of synthetic observations using 3D Gaussian Splatting. The Gaussians are fit to the real driving scene and then re-rendered from novel viewpoints corresponding to different ego positions, headings, and speeds. This produces a set of plausible future observations the ego might encounter under different actions. This phase is done once and cached.
Phase 2 -- Evaluation (Online): The AV model is evaluated in a loop: 1. Given an observation, the AV predicts an action (trajectory) 2. The system selects the pre-generated synthetic observation that best matches where the AV's action would take it, using proximity-based weighting 3. This synthetic observation becomes the input for the next step 4. The process repeats, capturing compounding errors that open-loop evaluation misses
The proximity-based weighting scheme is critical: rather than requiring exact trajectory matching (which would need infinite pre-generated observations), the system uses a Gaussian-weighted average with kernel variance σ² (w⁽ⁱ⁾ = exp(−‖xⁱ − x̂‖²/2σ²), optimal at σ²=0.1) to prioritize synthetic observations closest to the Stage 1 endpoint, creating a soft approximation of the true closed-loop dynamics.
NAVSIM v2 Benchmark Design: The benchmark uses a curated subset of nuPlan called "navhard" -- challenging scenarios evaluated over two stages. The primary leaderboard metric is EPDMS (Extended PDM Score), adapted to pseudo-simulation scoring.
Results
Correlation with Closed-Loop Simulation
| Evaluation Method | R^2 with Closed-Loop |
|---|---|
| Standard open-loop (L2 distance) | 0.3-0.5 |
| Best open-loop metric (PDM-Open) | 0.7 |
| Pseudo-simulation (NAVSIM v2) | 0.8 |
NAVSIM v2 Leaderboard
The v2 benchmark uses EPDMS (not PDMS), a different metric reflecting pseudo-simulation-based scoring. Scores are not directly comparable to NAVSIM v1 PDMS values.
The paper establishes a public leaderboard for the community. For current rankings and submitted method scores, see the online leaderboard (not reproduced here to avoid stale snapshot data).
Pseudo-simulation reveals that several methods that score well on open-loop metrics fail in pseudo-simulation due to compounding errors in dynamic scenarios.
Limitations
- Pre-generated observations cannot capture all possible future states; rare or extreme deviations from the observation bank may not be well represented
- 3D Gaussian Splatting rendering quality degrades at large viewpoint changes from the original observation
- Does not model reactive agents (other vehicles, pedestrians do not respond to the ego vehicle's actions)
- Requires high-quality 3D reconstruction of driving scenes, which depends on sensor quality and scene complexity
- R^2=0.8 correlation with closed-loop is strong but not perfect; safety-critical edge cases may still be missed
Connections
- Supersedes the original NAVSIM (2406.15349, NeurIPS 2024) as the standard E2E driving benchmark
- Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation reports 92.0 PDMS on NAVSIM v1
- Carla An Open Urban Driving Simulator provides full closed-loop simulation that NAVSIM v2 approximates more cheaply
- Builds on the nuPlan dataset which extends Nuscenes A Multimodal Dataset For Autonomous Driving
- Evaluates methods like Planning Oriented Autonomous Driving (UniAD) and Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving (TransFuser)
- Gaussian Splatting rendering connects to the perception methods in Gaussianocc Fully Self Supervised 3D Occupancy Estimation With Gaussian Splatting and Gaussrender Learning 3D Occupancy With Gaussian Rendering