ESC

πŸ“„ Read on arXiv

Overview

DiMA addresses the core tension in autonomous driving between vision-based planners (efficient but fragile on rare scenarios) and LLM-based approaches (strong reasoning but prohibitively expensive at inference). Rather than choosing one paradigm, DiMA bridges them through knowledge distillation: it jointly trains a vision-based planner alongside a multi-modal LLM, then discards the LLM at inference time, retaining only the efficient planner that has absorbed the LLM's reasoning capabilities.

The key innovation is joint training rather than post-hoc distillation. The vision-based scene encoder is trained simultaneously with the MLLM through four tasks: visual question answering, trajectory estimation, KL-divergence-based knowledge distillation, and three surrogate tasks (masked token reconstruction, future token prediction, and scene editing). The BEAM token representation structures the scene into BEV, ego-vehicle, agent, and map components, providing explicit modeling of driving scene elements.

DiMA achieves 37% reduction in L2 trajectory error, 80% reduction in collision rate, and 44% improvement in long-tail scenarios compared to vision-only baselines on nuScenes. Critically, at inference the system runs as a lightweight vision planner with no LLM overhead.

Key Contributions

  • Joint training framework that distills MLLM reasoning into a vision planner during training, discarding the LLM at inference for zero additional cost
  • BEAM token structured representation (BEV + Ego + Agent + Map) providing explicit scene decomposition for distillation
  • Three surrogate tasks that enhance representation learning: masked token reconstruction, future token prediction, and counterfactual scene editing
  • 80% collision rate reduction and 44% improvement on long-tail scenarios (overtaking, three-point turns) vs vision-only baselines
  • Demonstrates that the efficiency-vs-reasoning tradeoff can be resolved through distillation rather than architectural compromise

Architecture / Method

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DiMA Training Framework                       β”‚
β”‚                                                                 β”‚
β”‚  Multi-view      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  Cameras ───────►│     Vision Scene Encoder         β”‚            β”‚
β”‚                  β”‚  (produces BEAM tokens)          β”‚            β”‚
β”‚                  β”‚  B=BEV  E=Ego  A=Agent  M=Map   β”‚            β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                             β”‚              β”‚                     β”‚
β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              └──────────────┐      β”‚
β”‚            β–Ό                                             β–Ό      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  KL divergence  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Planning Transformer│◄──────────────►│  LLaMA-v1.5 7B  β”‚   β”‚
β”‚  β”‚  (lightweight)       β”‚  distillation  β”‚  (MLLM Teacher)  β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚                   β”‚   β”‚
β”‚  β”‚  Task: trajectory    β”‚                β”‚  Tasks:           β”‚   β”‚
β”‚  β”‚  estimation          β”‚                β”‚  - VQA            β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚  - Trajectory est.β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚  - Surrogate:     β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚    Β· Masked recon β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚    Β· Future pred  β”‚   β”‚
β”‚  β”‚                      β”‚                β”‚    Β· Scene edit   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚             β”‚                                    βœ•              β”‚
β”‚             β–Ό                             (discarded at         β”‚
β”‚     Planned Trajectory                     inference)           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚  β”‚  INFERENCE ONLY:    β”‚                                        β”‚
β”‚  β”‚  Encoder + Planner  β”‚  ◄── No LLM overhead at deploy time   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture

The framework has two components during training:

Vision-Based Planner (retained at inference): - Scene encoder producing structured BEAM tokens - Planning transformer for trajectory generation

Multi-Modal LLM (discarded at inference): - LLAMA-v1.5-7B processing structured BEAM representations - Handles four training objectives:

  1. Visual Question Answering: Scene understanding through language
  2. Trajectory Estimation: Direct planning supervision
  3. Knowledge Distillation: KL divergence minimization between MLLM and planner hidden features, aligning internal representations
  4. Surrogate Tasks: - Masked token reconstruction: Reconstruct randomly masked BEAM tokens (forces robust encoding) - Future token prediction: Forecast future scene states (temporal reasoning) - Scene editing: Add/remove agents to teach counterfactual reasoning about hypothetical scenarios

BEAM Token Structure: - BEV tokens: map and spatial layout - Ego tokens: autonomous vehicle state - Agent tokens: surrounding dynamic objects - Map tokens: additional road elements

Training uses AdamW optimizer on 8x NVIDIA A100 GPUs with nuScenes and DriveLM datasets.

Results

Results

nuScenes Planning Benchmark

Model Type L2 Error (m) Collision Rate (%)
UniAD Vision E2E -- --
VAD Vision E2E baseline baseline
PARA-Drive LLM-based -- --
TOKEN-Drive LLM-based -- --
DiMA Distilled -37% vs VAD -80% vs VAD

Long-Tail Scenarios

Scenario DiMA L2 (m) VAD L2 (m) Improvement
Overtaking 0.66 1.06 -37.7%
Three-point turn 1.05 1.57 -33.1%
Overall long-tail -- -- -44%

DiMA also demonstrates VQA capabilities enabling language-guided scene reasoning, improving interpretability and allowing users to query system decisions -- though this requires the MLLM to be present (inference-time option for debugging).

Limitations & Open Questions

  • The distillation framework requires training with a full 7B MLLM, increasing training cost even though inference is cheap -- this limits rapid iteration
  • Evaluated on nuScenes open-loop metrics only; closed-loop evaluation would better validate the long-tail improvements
  • Whether the BEAM token decomposition generalizes beyond nuScenes-style urban driving (e.g., highway, rural, construction zones) is untested

Connections