Autonomous Driving
Autonomous driving is the central application domain of this wiki. The field has undergone three distinct eras of architectural philosophy, each reflecting broader shifts in how ML is applied to safety-critical control.
The traditional stack
The canonical decomposition splits driving into Perception, Prediction, and Planning, with mapping, localization, control, and safety monitoring as overlays. Each module is developed and evaluated independently, with hand-designed interfaces between them. This modularity aids debugging and certification but creates information bottlenecks and error propagation.
Era 1: Modular pipelines (pre-2020)
Early learning-based driving focused on individual modules. Nuscenes A Multimodal Dataset For Autonomous Driving provided the benchmark that drove perception research. Systems used separate detection, tracking, prediction, and planning components. The key limitation: optimizing each module independently does not optimize the full driving task. Errors in perception propagate through prediction into planning with no mechanism for recovery.
Era 2: Hybrid and end-to-end learning (2020--2023)
The second era introduced joint training across modules while preserving interpretable intermediate representations. Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving fused camera and LiDAR features through transformers for direct waypoint prediction. Planning Oriented Autonomous Driving (UniAD) demonstrated that jointly training perception, prediction, and planning with a planning-centric loss yields large gains. Vad Vectorized Scene Representation For Efficient Autonomous Driving showed that vectorized scene representations enable efficient end-to-end driving without dense rasterized maps.
Imitation learning matured through this era. Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced data augmentation for imitation robustness. Learning By Cheating established the privileged-agent distillation paradigm: train an expert with ground-truth access, then distill into a sensorimotor student. Simulation benchmarks like Carla An Open Urban Driving Simulator became the standard testbed.
Era 3: Foundation models and VLA systems (2023+)
The current era applies large vision-language models directly to driving. Emma End To End Multimodal Model For Autonomous Driving (EMMA) treats all driving outputs as language tokens, including trajectories. Senna Bridging Large Vision Language Models And End To End Autonomous Driving decouples VLM reasoning from continuous planning. Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation integrates vision-language understanding with action generation in closed-loop. Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving achieves real-time deployment with RL-enhanced reasoning.
This era is also marked by the introduction of RL beyond imitation: Alphadrive Unleashing The Power Of Vlms In Autonomous Driving applies GRPO-style RL to driving VLMs, while Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving uses mixture-of-experts to handle the multimodal nature of driving decisions.
AutoVLA (Autovala Vision Language Action Model For End To End Autonomous Driving, 2025) introduces dual-process adaptive reasoning -- dynamically switching between fast direct action and slow chain-of-thought reasoning based on scenario complexity -- with RL fine-tuning on a compact Qwen2.5-VL-3B backbone.
DriveTransformer (Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving, 2025) rethinks the E2E architecture itself: parallel task processing with sparse queries replaces the sequential dense-BEV pipeline, achieving SOTA on Bench2Drive with favorable scaling laws showing decoder scaling matters more than backbone scaling.
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model demonstrates that open-source VLAs with hierarchical 3D queries can match larger models at 0.5B scale. Dima Distilling Multi Modal Large Language Models For Autonomous Driving shows MLLM reasoning can be distilled into efficient vision planners, resolving the efficiency-vs-reasoning tradeoff with 80% collision reduction and zero inference overhead.
World models have also emerged as a key paradigm. Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation unifies 3D scene understanding and future generation in a single LLM framework. Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction reformulates occupancy prediction as world modeling using 3D Gaussians. Momad Momentum Aware Planning In End To End Autonomous Driving addresses temporal inconsistency in E2E planning through momentum-aware trajectory selection.
Benchmarks and evaluation
- nuScenes (Nuscenes A Multimodal Dataset For Autonomous Driving): de facto standard for perception and open-loop planning evaluation.
- CARLA (Carla An Open Urban Driving Simulator): primary closed-loop simulation benchmark. Leaderboard versions (Town05, Longest6, Bench2Drive) test increasingly difficult scenarios.
- NAVSIM (Navsim Data Driven Non Reactive Autonomous Vehicle Simulation): non-reactive simulation benchmark that bridges open-loop and closed-loop evaluation. Its PDM Score achieves 0.7--0.8 correlation with closed-loop metrics while being computationally tractable. Successor Navsim V2 Pseudo Simulation For Autonomous Driving extends this with pseudo-simulation via 3D Gaussian Splatting.
- Open-loop vs closed-loop: A recurring tension. Open-loop metrics (L2 displacement, collision rate on replayed logs) often fail to predict closed-loop competence. The field is converging on closed-loop evaluation as the minimum standard.
What makes driving distinct
- Safety-critical operation at high speed with no tolerance for exploration failures
- Severe long-tail distribution: rare events dominate real-world risk
- Multi-agent interaction with partially observable, adversarial participants
- Large train/deploy distribution gap across geographies, weather, and infrastructure
Present state and open problems
- Closed-loop gap: Many state-of-the-art systems still rely primarily on open-loop evaluation. Bridging the open-loop/closed-loop performance gap is the field's most urgent methodological problem.
- Sim-to-real transfer: CARLA results do not reliably predict real-world performance. Better simulators and domain adaptation remain critical.
- Safety certification: No consensus framework exists for certifying learned driving systems.
- Data scaling: Whether scaling driving data follows the same power laws as language modeling is unresolved.
- Interpretability: Regulators and users demand explanations for driving decisions, but most end-to-end systems operate as black boxes.