Planning
Planning converts scene understanding and future predictions into driving actions. It is the module closest to the physical world and the one where errors are most consequential. The field has evolved from rule-based state machines through learned imitation planners to VLA systems that fold planning into a single multimodal model.
Planning hierarchy
Driving planning traditionally operates at multiple levels:
- Route planning: High-level navigation graph search (A* on road network). Largely solved.
- Behavior planning: Discrete decisions (lane change, yield, merge, stop). Traditionally rule-based or finite state machines.
- Trajectory planning: Continuous path generation satisfying kinematic constraints, comfort, and safety. The focus of most learned planning research.
- Control: Low-level actuation (steering, throttle, brake). Often a PID or MPC controller tracking the planned trajectory.
The end-to-end trend collapses these levels. Modern systems often output trajectories or waypoints directly, bypassing explicit behavior planning entirely.
Evolution of learned planning
ChauffeurNet and robust imitation (2019)
Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced key ideas for making imitation learning robust for planning: synthesizing perturbations during training, adding trajectory perturbation loss, and using a mid-level bird's-eye-view representation. ChauffeurNet demonstrated that naive behavior cloning fails due to distributional drift and that data augmentation is essential.
Privileged distillation (2020)
Learning By Cheating formalized the two-stage approach: first train a privileged expert with access to ground-truth state, then distill into a sensorimotor student that operates from raw sensors. This paradigm dominates CARLA benchmarks and provides a principled way to separate the "what to do" problem from the "how to perceive" problem.
Joint planning in end-to-end systems (2023)
Planning Oriented Autonomous Driving (UniAD) demonstrated that jointly training perception, prediction, and planning with a planning-centric objective yields significantly better planning than modular pipelines. The planner receives features from upstream modules and optimizes a trajectory that is both safe and comfortable. Vad Vectorized Scene Representation For Efficient Autonomous Driving extended this with vectorized representations, making the joint system more efficient.
VLA planners (2024--2025)
The current frontier applies vision-language models directly to planning:
- Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation integrates language understanding with action generation, using planning tokens that bridge VLM reasoning and continuous trajectory output. Evaluated in closed-loop.
- Senna Bridging Large Vision Language Models And End To End Autonomous Driving decouples VLM reasoning from the planner: the VLM produces human-readable scene descriptions and driving rationale, which a separate lightweight planner converts to trajectories. This preserves interpretability while leveraging VLM knowledge.
- Emma End To End Multimodal Model For Autonomous Driving takes the maximalist approach, representing trajectories as language tokens and training a single VLM to produce them directly.
- Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving achieves real-time VLA planning (99ms latency) with RL-enhanced reasoning, demonstrating that VLA planners can meet deployment constraints.
- Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model complements VLA planning with a BEV world model that evaluates candidate trajectories for physical plausibility and safety.
RL for planning beyond imitation
A key 2025 development is the application of reinforcement learning to push planning beyond the imitation ceiling. Alphadrive Unleashing The Power Of Vlms In Autonomous Driving applies GRPO (Group Relative Policy Optimization) to driving VLMs, showing that RL fine-tuning improves planning in scenarios where the demonstration data contains suboptimal behavior. Drivemoe Mixture Of Experts For Vision Language Action In Autonomous Driving uses mixture-of-experts to handle the multimodal nature of planning decisions, where averaging across modes produces dangerous trajectories.
The open-loop vs. closed-loop debate
This is the field's most important evaluation question. Open-loop planning evaluation replays logged scenarios and measures trajectory displacement error against the human driver's actual trajectory. Closed-loop evaluation places the planner in simulation where its actions affect the scene.
The problem with open-loop evaluation: a planner that outputs the average trajectory (safe but passive) scores well on displacement metrics but drives terribly in closed-loop, where it must make decisive lane changes and assertive merges. Conversely, a planner that makes one aggressive-but-correct maneuver may score poorly on open-loop metrics because it deviates from the logged trajectory.
Carla An Open Urban Driving Simulator provides the primary closed-loop benchmark. Papers evaluated only on open-loop nuScenes metrics should be interpreted with significant caution. The field is converging on the position that closed-loop evaluation is a minimum requirement for planning claims.
Present state and open problems
- Imitation ceiling: Behavior cloning cannot exceed the quality of demonstration data. RL offers a path beyond this ceiling but introduces instability and reward design challenges.
- Safety guarantees: No learned planner provides formal safety guarantees. Combining learned planning with rule-based safety layers (responsibility-sensitive safety, control barrier functions) is an active area.
- Comfort and naturalness: Planning metrics focus on safety and progress but rarely measure comfort, smoothness, or human-likeness, which matter for passenger acceptance.
- Rare scenarios: Planners trained on normal driving fail on edge cases (emergency braking, construction zones, adversarial agents). How to ensure coverage of the long tail is unresolved.
- Interpretability: Regulators increasingly demand that planning decisions be explainable. The decoupled approach (Senna) offers one path; whether it sacrifices performance is debated.
- Multi-agent game theory: Planning in dense traffic is a multi-agent game. Most planners treat other agents as independent obstacles rather than strategic actors.