ESC

Open Questions: End-to-End Driving

Stream-specific open questions for the end-to-end autonomous driving pillar. See Open Questions for the full tree across all streams.

Architectural design

  1. Unified vs. decoupled VLA: Will EMMA's "everything as language tokens" or Senna's decoupled reasoning + planning prove more scalable and deployable? EMMA is simpler; Senna is more interpretable. Neither has been tested at full production scale with safety guarantees.

  2. Parallel vs. sequential task processing: DriveTransformer parallelizes perception-prediction-planning via shared attention, while UniAD processes them sequentially with joint training. Does parallel processing lose important causal structure, or is the efficiency gain worth it?

  3. Intermediate supervision necessity: Type 3 systems (UniAD, VAD) use explicit 3D detection/prediction supervision. Type 4 systems (EMMA) largely do not. Is intermediate supervision a necessary scaffold for safety-certifiable systems, or an unnecessary constraint that limits scaling?

Planning paradigm

  1. Generative vs. discriminative planning: DiffusionDrive (diffusion, 45 FPS) and GoalFlow (flow matching) represent multimodal futures natively. VADv2 uses a discrete action vocabulary. SparseDrive uses sparse factorized scoring. Which paradigm best handles the multimodality of real driving?

  2. RL vs. imitation ceiling: CarPlanner is the first RL planner to beat IL+rules on nuPlan. Does this signal a fundamental ceiling in imitation learning for driving, paralleling the LLM trajectory (pretraining → SFT → RLHF)?

  3. Scaling laws for driving: DriveGPT (Waymo) demonstrated LLM-style scaling laws hold for driving behavior models. Do these laws continue to hold, and what is the compute-optimal data-to-parameter ratio for driving?

Evaluation and deployment

  1. Benchmark adequacy: Are NAVSIM and Bench2Drive sufficient for evaluating 2025-era E2E systems? "Is Ego Status All You Need?" exposed fatal flaws in open-loop nuScenes evaluation. Do current closed-loop benchmarks have similar blind spots?

  2. Temporal consistency: MomAD shows E2E planners produce jittery trajectories. Is momentum-aware planning a sufficient fix, or is temporal inconsistency a deeper architectural problem?

  3. Real-time inference: Most VLA-based E2E systems exceed real-time latency budgets. DiMA's approach (discard LLM at inference) and DiffusionDrive's truncation (20→2 steps) are workarounds. Is there a principled architecture that is both powerful and real-time?

Partially answered

  • Q4 (Generative planning): Evidence increasingly favors generative. DiffusionDrive, GoalFlow, and GenAD all show strong results. VADv2's vocabulary-based approach bridges generative and discriminative.
  • Q5 (RL ceiling): CarPlanner and the LLM analogy (InstructGPT → DPO → R1) suggest SFT has a ceiling. But driving RL reward design remains much harder than language reward design.
  • Q7 (Benchmarks): NAVSIM v2 addresses some limitations with pseudo-simulation, but reactive multi-agent evaluation at scale remains unsolved.

Key papers for this stream

Paper Relevance
Planning Oriented Autonomous Driving UniAD: joint modular E2E reference
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving Parallel-task E2E SOTA
Emma End To End Multimodal Model For Autonomous Driving Everything-as-tokens at industry scale
Senna Bridging Large Vision Language Models And End To End Autonomous Driving Decoupled reasoning + planning
Diffusiondrive Truncated Diffusion Model For End To End Autonomous Driving Real-time diffusion planning
Goalflow Goal Driven Flow Matching For Multimodal Trajectory Generation Flow matching for trajectories
Carplanner Consistent Autoregressive Rl Planner For Autonomous Driving First RL planner beating IL+rules
Drivegpt Scaling Autoregressive Behavior Models For Driving Scaling laws for driving
Navsim Data Driven Non Reactive Autonomous Vehicle Simulation Evaluation benchmark