ESC

Planning

Planning converts scene understanding and future predictions into driving actions. It is the module closest to the physical world and the one where errors are most consequential. The field has evolved from rule-based state machines through learned imitation planners to VLA systems that fold planning into a single multimodal model.

Planning hierarchy

Driving planning traditionally operates at multiple levels:

  • Route planning: High-level navigation graph search (A* on road network). Largely solved.
  • Behavior planning: Discrete decisions (lane change, yield, merge, stop). Traditionally rule-based or finite state machines.
  • Trajectory planning: Continuous path generation satisfying kinematic constraints, comfort, and safety. The focus of most learned planning research.
  • Control: Low-level actuation (steering, throttle, brake). Often a PID or MPC controller tracking the planned trajectory.

The end-to-end trend collapses these levels. Modern systems often output trajectories or waypoints directly, bypassing explicit behavior planning entirely.

Evolution of learned planning

ChauffeurNet and robust imitation (2019)

Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced key ideas for making imitation learning robust for planning: synthesizing perturbations during training, adding trajectory perturbation loss, and using a mid-level bird's-eye-view representation. ChauffeurNet demonstrated that naive behavior cloning fails due to distributional drift and that data augmentation is essential.

Privileged distillation (2020)

Learning By Cheating formalized the two-stage approach: first train a privileged expert with access to ground-truth state, then distill into a sensorimotor student that operates from raw sensors. This paradigm dominates CARLA benchmarks and provides a principled way to separate the "what to do" problem from the "how to perceive" problem.

Joint planning in end-to-end systems (2023)

Planning Oriented Autonomous Driving (UniAD) demonstrated that jointly training perception, prediction, and planning with a planning-centric objective yields significantly better planning than modular pipelines. The planner receives features from upstream modules and optimizes a trajectory that is both safe and comfortable. Vad Vectorized Scene Representation For Efficient Autonomous Driving extended this with vectorized representations, making the joint system more efficient.

VLA planners (2024--2025)

The current frontier applies vision-language models directly to planning:

RL for planning beyond imitation

A key 2025 development is the application of reinforcement learning to push planning beyond the imitation ceiling. Alphadrive Unleashing The Power Of Vlms In Autonomous Driving applies GRPO (Group Relative Policy Optimization) to driving VLMs, showing that RL fine-tuning improves planning in scenarios where the demonstration data contains suboptimal behavior. Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving uses mixture-of-experts to handle the multimodal nature of planning decisions, where averaging across modes produces dangerous trajectories.

The open-loop vs. closed-loop debate

This is the field's most important evaluation question. Open-loop planning evaluation replays logged scenarios and measures trajectory displacement error against the human driver's actual trajectory. Closed-loop evaluation places the planner in simulation where its actions affect the scene.

The problem with open-loop evaluation: a planner that outputs the average trajectory (safe but passive) scores well on displacement metrics but drives terribly in closed-loop, where it must make decisive lane changes and assertive merges. Conversely, a planner that makes one aggressive-but-correct maneuver may score poorly on open-loop metrics because it deviates from the logged trajectory.

Carla An Open Urban Driving Simulator provides the primary closed-loop benchmark. Papers evaluated only on open-loop nuScenes metrics should be interpreted with significant caution. The field is converging on the position that closed-loop evaluation is a minimum requirement for planning claims.

Present state and open problems

  • Imitation ceiling: Behavior cloning cannot exceed the quality of demonstration data. RL offers a path beyond this ceiling but introduces instability and reward design challenges.
  • Safety guarantees: No learned planner provides formal safety guarantees. Combining learned planning with rule-based safety layers (responsibility-sensitive safety, control barrier functions) is an active area.
  • Comfort and naturalness: Planning metrics focus on safety and progress but rarely measure comfort, smoothness, or human-likeness, which matter for passenger acceptance.
  • Rare scenarios: Planners trained on normal driving fail on edge cases (emergency braking, construction zones, adversarial agents). How to ensure coverage of the long tail is unresolved.
  • Interpretability: Regulators increasingly demand that planning decisions be explainable. The decoupled approach (Senna) offers one path; whether it sacrifices performance is debated.
  • Multi-agent game theory: Planning in dense traffic is a multi-agent game. Most planners treat other agents as independent obstacles rather than strategic actors.

Key papers

Paper Contribution
Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst Robust imitation learning with data augmentation
Learning By Cheating Privileged expert distillation for planning
Planning Oriented Autonomous Driving UniAD: planning-centric joint training
Vad Vectorized Scene Representation For Efficient Autonomous Driving Efficient vectorized planning
Vadv2 End To End Vectorized Autonomous Driving Via Probabilistic Planning Probabilistic planning via discrete action vocabulary
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation Vision-language-instructed action generation
Senna Bridging Large Vision Language Models And End To End Autonomous Driving Decoupled VLM reasoning + lightweight planner
Emma End To End Multimodal Model For Autonomous Driving Trajectories as language tokens
Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving Real-time VLA planning with RL
Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model World-model trajectory verification
Alphadrive Unleashing The Power Of Vlms In Autonomous Driving GRPO-based RL for driving VLMs
Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving MoE for multimodal planning
Momad Momentum Aware Planning In End To End Autonomous Driving Momentum-aware temporal consistency for E2E planning
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model Open-source VLA with hierarchical 3D scene queries
Dima Distilling Multi Modal Large Language Models For Autonomous Driving MLLM-to-planner distillation for efficient planning
Autovala Vision Language Action Model For End To End Autonomous Driving Adaptive reasoning VLA with RL fine-tuning
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving Parallel-task planning with GMM multi-mode
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation Self-supervised MLLM planning without annotations
Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction Multi-step temporal queries for history-enhanced planning
Drive Occworld Driving In The Occupancy World Occupancy world model for planning trajectory evaluation
Carla An Open Urban Driving Simulator Primary closed-loop evaluation benchmark
A Language Agent For Autonomous Driving LLM cognitive agent with tool use, memory, and chain-of-thought for planning
Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving Asynchronous LLM-planner decoupling for real-time driving
Occworld Learning A 3D Occupancy World Model For Autonomous Driving Original occupancy world model for joint scene-ego prediction
Sparsedrive End To End Autonomous Driving Via Sparse Scene Representation Sparse parallel prediction-planning with safety-aware selection
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation Factorized trajectory vocabulary scoring, 92.0 PDMS SOTA
Navsim V2 Pseudo Simulation For Autonomous Driving Pseudo-simulation benchmark, R^2=0.8 with closed-loop
Think Twice Before Driving Towards Scalable Decoders For End To End Autonomous Driving Cascaded decoder with iterative trajectory refinement, scalable decoder depth
Driveadapter Breaking The Coupling Barrier Of Perception And Planning In End To End Autonomous Driving Decoupled perception-planning via adapter, reuses frozen privileged planner
Llms Cant Plan But Can Help Planning In Llm Modulo Frameworks LLMs cannot plan autonomously; LLM-Modulo framework with external verification for sound planning