Planning

Planning converts scene understanding and future predictions into driving actions. It is the module closest to the physical world and the one where errors are most consequential. The field has evolved from rule-based state machines through learned imitation planners to VLA systems that fold planning into a single multimodal model.

Planning hierarchy

Driving planning traditionally operates at multiple levels:

Route planning: High-level navigation graph search (A* on road network). Largely solved.
Behavior planning: Discrete decisions (lane change, yield, merge, stop). Traditionally rule-based or finite state machines.
Trajectory planning: Continuous path generation satisfying kinematic constraints, comfort, and safety. The focus of most learned planning research.
Control: Low-level actuation (steering, throttle, brake). Often a PID or MPC controller tracking the planned trajectory.

The end-to-end trend collapses these levels. Modern systems often output trajectories or waypoints directly, bypassing explicit behavior planning entirely.

Evolution of learned planning

ChauffeurNet and robust imitation (2019)

Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced key ideas for making imitation learning robust for planning: synthesizing perturbations during training, adding trajectory perturbation loss, and using a mid-level bird's-eye-view representation. ChauffeurNet demonstrated that naive behavior cloning fails due to distributional drift and that data augmentation is essential.

Privileged distillation (2020)

Learning By Cheating formalized the two-stage approach: first train a privileged expert with access to ground-truth state, then distill into a sensorimotor student that operates from raw sensors. This paradigm dominates CARLA benchmarks and provides a principled way to separate the "what to do" problem from the "how to perceive" problem.

Joint planning in end-to-end systems (2023)

Planning Oriented Autonomous Driving (UniAD) demonstrated that jointly training perception, prediction, and planning with a planning-centric objective yields significantly better planning than modular pipelines. The planner receives features from upstream modules and optimizes a trajectory that is both safe and comfortable. Vad Vectorized Scene Representation For Efficient Autonomous Driving extended this with vectorized representations, making the joint system more efficient.

VLA planners (2024--2025)

The current frontier applies vision-language models directly to planning:

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation integrates language understanding with action generation, using planning tokens that bridge VLM reasoning and continuous trajectory output. Evaluated in closed-loop.
Senna Bridging Large Vision Language Models And End To End Autonomous Driving decouples VLM reasoning from the planner: the VLM produces human-readable scene descriptions and driving rationale, which a separate lightweight planner converts to trajectories. This preserves interpretability while leveraging VLM knowledge.
Emma End To End Multimodal Model For Autonomous Driving takes the maximalist approach, representing trajectories as language tokens and training a single VLM to produce them directly.
Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving achieves real-time VLA planning (99ms latency) with RL-enhanced reasoning, demonstrating that VLA planners can meet deployment constraints.
Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model complements VLA planning with a BEV world model that evaluates candidate trajectories for physical plausibility and safety.

RL for planning beyond imitation

A key 2025 development is the application of reinforcement learning to push planning beyond the imitation ceiling. Alphadrive Unleashing The Power Of Vlms In Autonomous Driving applies GRPO (Group Relative Policy Optimization) to driving VLMs, showing that RL fine-tuning improves planning in scenarios where the demonstration data contains suboptimal behavior. Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving uses mixture-of-experts to handle the multimodal nature of planning decisions, where averaging across modes produces dangerous trajectories.

The open-loop vs. closed-loop debate

This is the field's most important evaluation question. Open-loop planning evaluation replays logged scenarios and measures trajectory displacement error against the human driver's actual trajectory. Closed-loop evaluation places the planner in simulation where its actions affect the scene.

The problem with open-loop evaluation: a planner that outputs the average trajectory (safe but passive) scores well on displacement metrics but drives terribly in closed-loop, where it must make decisive lane changes and assertive merges. Conversely, a planner that makes one aggressive-but-correct maneuver may score poorly on open-loop metrics because it deviates from the logged trajectory.

Carla An Open Urban Driving Simulator provides the primary closed-loop benchmark. Papers evaluated only on open-loop nuScenes metrics should be interpreted with significant caution. The field is converging on the position that closed-loop evaluation is a minimum requirement for planning claims.

Present state and open problems

Imitation ceiling: Behavior cloning cannot exceed the quality of demonstration data. RL offers a path beyond this ceiling but introduces instability and reward design challenges.
Safety guarantees: No learned planner provides formal safety guarantees. Combining learned planning with rule-based safety layers (responsibility-sensitive safety, control barrier functions) is an active area.
Comfort and naturalness: Planning metrics focus on safety and progress but rarely measure comfort, smoothness, or human-likeness, which matter for passenger acceptance.
Rare scenarios: Planners trained on normal driving fail on edge cases (emergency braking, construction zones, adversarial agents). How to ensure coverage of the long tail is unresolved.
Interpretability: Regulators increasingly demand that planning decisions be explainable. The decoupled approach (Senna) offers one path; whether it sacrifices performance is debated.
Multi-agent game theory: Planning in dense traffic is a multi-agent game. Most planners treat other agents as independent obstacles rather than strategic actors.

Key papers

Paper	Contribution
Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst	Robust imitation learning with data augmentation
Learning By Cheating	Privileged expert distillation for planning
Planning Oriented Autonomous Driving	UniAD: planning-centric joint training
Vad Vectorized Scene Representation For Efficient Autonomous Driving	Efficient vectorized planning
Vadv2 End To End Vectorized Autonomous Driving Via Probabilistic Planning	Probabilistic planning via discrete action vocabulary
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation	Vision-language-instructed action generation
Senna Bridging Large Vision Language Models And End To End Autonomous Driving	Decoupled VLM reasoning + lightweight planner
Emma End To End Multimodal Model For Autonomous Driving	Trajectories as language tokens
Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving	Real-time VLA planning with RL
Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model	World-model trajectory verification
Alphadrive Unleashing The Power Of Vlms In Autonomous Driving	GRPO-based RL for driving VLMs
Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving	MoE for multimodal planning
Momad Momentum Aware Planning In End To End Autonomous Driving	Momentum-aware temporal consistency for E2E planning
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model	Open-source VLA with hierarchical 3D scene queries
Dima Distilling Multi Modal Large Language Models For Autonomous Driving	MLLM-to-planner distillation for efficient planning
Autovala Vision Language Action Model For End To End Autonomous Driving	Adaptive reasoning VLA with RL fine-tuning
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving	Parallel-task planning with GMM multi-mode
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation	Self-supervised MLLM planning without annotations
Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction	Multi-step temporal queries for history-enhanced planning
Drive Occworld Driving In The Occupancy World	Occupancy world model for planning trajectory evaluation
Carla An Open Urban Driving Simulator	Primary closed-loop evaluation benchmark
A Language Agent For Autonomous Driving	LLM cognitive agent with tool use, memory, and chain-of-thought for planning
Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving	Asynchronous LLM-planner decoupling for real-time driving
Occworld Learning A 3D Occupancy World Model For Autonomous Driving	Original occupancy world model for joint scene-ego prediction
Sparsedrive End To End Autonomous Driving Via Sparse Scene Representation	Sparse parallel prediction-planning with safety-aware selection
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation	Factorized trajectory vocabulary scoring, 92.0 PDMS SOTA
Navsim V2 Pseudo Simulation For Autonomous Driving	Pseudo-simulation benchmark, R^2=0.8 with closed-loop
Think Twice Before Driving Towards Scalable Decoders For End To End Autonomous Driving	Cascaded decoder with iterative trajectory refinement, scalable decoder depth
Driveadapter Breaking The Coupling Barrier Of Perception And Planning In End To End Autonomous Driving	Decoupled perception-planning via adapter, reuses frozen privileged planner
Llms Cant Plan But Can Help Planning In Llm Modulo Frameworks	LLMs cannot plan autonomously; LLM-Modulo framework with external verification for sound planning