ESC

Autonomous Driving

Autonomous driving is the central application domain of this wiki. The field has undergone three distinct eras of architectural philosophy, each reflecting broader shifts in how ML is applied to safety-critical control.

The traditional stack

The canonical decomposition splits driving into Perception, Prediction, and Planning, with mapping, localization, control, and safety monitoring as overlays. Each module is developed and evaluated independently, with hand-designed interfaces between them. This modularity aids debugging and certification but creates information bottlenecks and error propagation.

Era 1: Modular pipelines (pre-2020)

Early learning-based driving focused on individual modules. Nuscenes A Multimodal Dataset For Autonomous Driving provided the benchmark that drove perception research. Systems used separate detection, tracking, prediction, and planning components. The key limitation: optimizing each module independently does not optimize the full driving task. Errors in perception propagate through prediction into planning with no mechanism for recovery.

Era 2: Hybrid and end-to-end learning (2020--2023)

The second era introduced joint training across modules while preserving interpretable intermediate representations. Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving fused camera and LiDAR features through transformers for direct waypoint prediction. Planning Oriented Autonomous Driving (UniAD) demonstrated that jointly training perception, prediction, and planning with a planning-centric loss yields large gains. Vad Vectorized Scene Representation For Efficient Autonomous Driving showed that vectorized scene representations enable efficient end-to-end driving without dense rasterized maps.

Imitation learning matured through this era. Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced data augmentation for imitation robustness. Learning By Cheating established the privileged-agent distillation paradigm: train an expert with ground-truth access, then distill into a sensorimotor student. Simulation benchmarks like Carla An Open Urban Driving Simulator became the standard testbed.

Era 3: Foundation models and VLA systems (2023+)

The current era applies large vision-language models directly to driving. Emma End To End Multimodal Model For Autonomous Driving (EMMA) treats all driving outputs as language tokens, including trajectories. Senna Bridging Large Vision Language Models And End To End Autonomous Driving decouples VLM reasoning from continuous planning. Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation integrates vision-language understanding with action generation in closed-loop. Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving achieves real-time deployment with RL-enhanced reasoning.

This era is also marked by the introduction of RL beyond imitation: Alphadrive Unleashing The Power Of Vlms In Autonomous Driving applies GRPO-style RL to driving VLMs, while Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving uses mixture-of-experts to handle the multimodal nature of driving decisions.

AutoVLA (Autovala Vision Language Action Model For End To End Autonomous Driving, 2025) introduces dual-process adaptive reasoning -- dynamically switching between fast direct action and slow chain-of-thought reasoning based on scenario complexity -- with RL fine-tuning on a compact Qwen2.5-VL-3B backbone.

DriveTransformer (Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving, 2025) rethinks the E2E architecture itself: parallel task processing with sparse queries replaces the sequential dense-BEV pipeline, achieving SOTA on Bench2Drive with favorable scaling laws showing decoder scaling matters more than backbone scaling.

Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model demonstrates that open-source VLAs with hierarchical 3D queries can match larger models at 0.5B scale. Dima Distilling Multi Modal Large Language Models For Autonomous Driving shows MLLM reasoning can be distilled into efficient vision planners, resolving the efficiency-vs-reasoning tradeoff with 80% collision reduction and zero inference overhead.

World models have also emerged as a key paradigm. Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation unifies 3D scene understanding and future generation in a single LLM framework. Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction reformulates occupancy prediction as world modeling using 3D Gaussians. Momad Momentum Aware Planning In End To End Autonomous Driving addresses temporal inconsistency in E2E planning through momentum-aware trajectory selection.

Benchmarks and evaluation

What makes driving distinct

  • Safety-critical operation at high speed with no tolerance for exploration failures
  • Severe long-tail distribution: rare events dominate real-world risk
  • Multi-agent interaction with partially observable, adversarial participants
  • Large train/deploy distribution gap across geographies, weather, and infrastructure

Present state and open problems

  • Closed-loop gap: Many state-of-the-art systems still rely primarily on open-loop evaluation. Bridging the open-loop/closed-loop performance gap is the field's most urgent methodological problem.
  • Sim-to-real transfer: CARLA results do not reliably predict real-world performance. Better simulators and domain adaptation remain critical.
  • Safety certification: No consensus framework exists for certifying learned driving systems.
  • Data scaling: Whether scaling driving data follows the same power laws as language modeling is unresolved.
  • Interpretability: Regulators and users demand explanations for driving decisions, but most end-to-end systems operate as black boxes.

Key papers

Paper Contribution
Planning Oriented Autonomous Driving UniAD: joint training with planning-centric objective
Vad Vectorized Scene Representation For Efficient Autonomous Driving Vectorized E2E driving without rasterized maps
Emma End To End Multimodal Model For Autonomous Driving Everything-as-tokens multimodal driving
Learning By Cheating Privileged-agent distillation paradigm
Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving Transformer-based sensor fusion for driving
Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst Robust imitation via data augmentation
Nuscenes A Multimodal Dataset For Autonomous Driving Multimodal driving dataset and benchmarks
Carla An Open Urban Driving Simulator Open urban driving simulator
Autovala Vision Language Action Model For End To End Autonomous Driving Adaptive dual-process VLA with RL
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving Parallel-task sparse transformer for E2E driving
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation Self-supervised MLLM for scalable annotation-free driving
Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction History-enhanced E2E driving with multi-step temporal queries
Gaussianlss Toward Real World Bev Perception With Depth Uncertainty Via Gaussian Splatting Efficient uncertainty-aware BEV via Gaussian Splatting
Drive Occworld Driving In The Occupancy World 4D occupancy world model for planning
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model Open-source VLA with hierarchical 3D scene queries
Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation Unified world model for 3D understanding + generation
Momad Momentum Aware Planning In End To End Autonomous Driving Momentum-aware temporal consistency for planning
Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction Gaussian world model for streaming occupancy
Dima Distilling Multi Modal Large Language Models For Autonomous Driving MLLM-to-planner distillation
Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving Asynchronous LLM-planner decoupling, ~40% cost reduction
Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction Sparse Gaussian occupancy with 5-6x memory reduction
Driving Gaussian Composite Gaussian Splatting For Surrounding Dynamic Driving Scenes Gaussian splatting for dynamic driving scene reconstruction
Occworld Learning A 3D Occupancy World Model For Autonomous Driving Original 3D occupancy world model with VQ-VAE + GPT
Gaussianocc Fully Self Supervised 3D Occupancy Estimation With Gaussian Splatting Fully self-supervised 3D occupancy via Gaussian splatting
Gaussianflowocc Sparse Occupancy With Gaussian Splatting And Temporal Flow Sparse Gaussian occupancy + temporal flow, 50x faster
Gaussrender Learning 3D Occupancy With Gaussian Rendering Plug-and-play Gaussian rendering loss for occupancy
Racformer Query Based Radar Camera Fusion For 3D Object Detection Radar-camera fusion surpassing LiDAR-only (CVPR 2025)
Sparsedrive End To End Autonomous Driving Via Sparse Scene Representation Fully sparse E2E driving, parallel prediction-planning
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation Factorized trajectory vocabulary, 92.0 PDMS NAVSIM SOTA
Navsim V2 Pseudo Simulation For Autonomous Driving Pseudo-simulation benchmark for E2E driving (CoRL 2025)