Open Questions: End-to-End Driving
Stream-specific open questions for the end-to-end autonomous driving pillar. See Open Questions for the full tree across all streams.
Architectural design
-
Unified vs. decoupled VLA: Will EMMA's "everything as language tokens" or Senna's decoupled reasoning + planning prove more scalable and deployable? EMMA is simpler; Senna is more interpretable. Neither has been tested at full production scale with safety guarantees.
-
Parallel vs. sequential task processing: DriveTransformer parallelizes perception-prediction-planning via shared attention, while UniAD processes them sequentially with joint training. Does parallel processing lose important causal structure, or is the efficiency gain worth it?
-
Intermediate supervision necessity: Type 3 systems (UniAD, VAD) use explicit 3D detection/prediction supervision. Type 4 systems (EMMA) largely do not. Is intermediate supervision a necessary scaffold for safety-certifiable systems, or an unnecessary constraint that limits scaling?
Planning paradigm
-
Generative vs. discriminative planning: DiffusionDrive (diffusion, 45 FPS) and GoalFlow (flow matching) represent multimodal futures natively. VADv2 uses a discrete action vocabulary. SparseDrive uses sparse factorized scoring. Which paradigm best handles the multimodality of real driving?
-
RL vs. imitation ceiling: CarPlanner is the first RL planner to beat IL+rules on nuPlan. Does this signal a fundamental ceiling in imitation learning for driving, paralleling the LLM trajectory (pretraining → SFT → RLHF)?
-
Scaling laws for driving: DriveGPT (Waymo) demonstrated LLM-style scaling laws hold for driving behavior models. Do these laws continue to hold, and what is the compute-optimal data-to-parameter ratio for driving?
Evaluation and deployment
-
Benchmark adequacy: Are NAVSIM and Bench2Drive sufficient for evaluating 2025-era E2E systems? "Is Ego Status All You Need?" exposed fatal flaws in open-loop nuScenes evaluation. Do current closed-loop benchmarks have similar blind spots?
-
Temporal consistency: MomAD shows E2E planners produce jittery trajectories. Is momentum-aware planning a sufficient fix, or is temporal inconsistency a deeper architectural problem?
-
Real-time inference: Most VLA-based E2E systems exceed real-time latency budgets. DiMA's approach (discard LLM at inference) and DiffusionDrive's truncation (20→2 steps) are workarounds. Is there a principled architecture that is both powerful and real-time?
Partially answered
- Q4 (Generative planning): Evidence increasingly favors generative. DiffusionDrive, GoalFlow, and GenAD all show strong results. VADv2's vocabulary-based approach bridges generative and discriminative.
- Q5 (RL ceiling): CarPlanner and the LLM analogy (InstructGPT → DPO → R1) suggest SFT has a ceiling. But driving RL reward design remains much harder than language reward design.
- Q7 (Benchmarks): NAVSIM v2 addresses some limitations with pseudo-simulation, but reactive multi-agent evaluation at scale remains unsolved.
Key papers for this stream
| Paper | Relevance |
|---|---|
| Planning Oriented Autonomous Driving | UniAD: joint modular E2E reference |
| Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving | Parallel-task E2E SOTA |
| Emma End To End Multimodal Model For Autonomous Driving | Everything-as-tokens at industry scale |
| Senna Bridging Large Vision Language Models And End To End Autonomous Driving | Decoupled reasoning + planning |
| Diffusiondrive Truncated Diffusion Model For End To End Autonomous Driving | Real-time diffusion planning |
| Goalflow Goal Driven Flow Matching For Multimodal Trajectory Generation | Flow matching for trajectories |
| Carplanner Consistent Autoregressive Rl Planner For Autonomous Driving | First RL planner beating IL+rules |
| Drivegpt Scaling Autoregressive Behavior Models For Driving | Scaling laws for driving |
| Navsim Data Driven Non Reactive Autonomous Vehicle Simulation | Evaluation benchmark |