Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation

:page_facing_up: Read on arXiv

Overview

Hydra-MDP addresses a fundamental limitation of imitation learning for autonomous driving: standard behavior cloning learns only to mimic human demonstrations, with no explicit optimization for safety-critical metrics like collision avoidance, drivable area compliance, or time-to-collision. The paper proposes a multi-target knowledge distillation framework where multiple "teacher" signals -- both human demonstrations and rule-based safety evaluators -- are distilled into a single student network through specialized prediction heads (the "Hydra" heads).

The core insight is that autonomous driving evaluation is inherently multi-dimensional (safety, compliance, comfort, progress), and collapsing these into a single scalar score before distillation loses critical information. Instead, Hydra-MDP trains the student to predict each metric independently via separate Hydra Prediction Heads, then selects the trajectory that optimizes a composite score at inference time. This preserves the multi-objective structure of the problem through the entire pipeline.

Hydra-MDP won first place in the NAVSIM challenge, achieving 86.5 PDM Score with the best single model and up to 91.0 with larger vision backbones. The results demonstrate strong scalability -- unlike prior work suggesting diminishing returns, performance consistently improves with larger backbones and richer planning vocabularies.

Key Contributions

Multi-target Hydra distillation: Separate prediction heads for each closed-loop metric (NC, DAC, TTC, Comfort, EP), avoiding information loss from score aggregation
Planning vocabulary via K-means clustering: Discretizes continuous trajectory space into 4,096–8,192 representative trajectory clusters (selected from 700,000 expert trajectories via K-means), converting planning into a scoring/selection problem
Offline simulation for teacher labels: Runs ground-truth-perception simulation for every trajectory candidate in the vocabulary, generating per-metric supervision without online simulation
Elimination of non-differentiable post-processing: The neural network directly learns the relationship between sensor observations and safety metrics, enabling end-to-end gradient flow
Scaling behavior: Demonstrates consistent improvements with larger vision backbones and planning vocabularies, contradicting prior claims of diminishing returns

Architecture / Method

┌──────────────────────────────────────────────────────────────────┐
│                    HYDRA-MDP PIPELINE                             │
│                                                                  │
│  ┌──────────┐  ┌──────────┐                                     │
│  │  Camera   │  │  LiDAR   │                                     │
│  └────┬─────┘  └────┬─────┘                                     │
│       │              │                                            │
│       ▼              ▼                                            │
│  ┌──────────────────────────────┐                                │
│  │   TransFuser Perception       │                                │
│  │   (image + LiDAR fusion)      │                                │
│  │   via transformer layers      │                                │
│  └──────────────┬───────────────┘                                │
│                 │ Environmental tokens                            │
│                 ▼                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │   Trajectory Scoring (over Planning Vocabulary)           │   │
│  │   4,096–8,192 candidates (K-means of 700K expert trajs)   │   │
│  │                                                           │   │
│  │   ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐│   │
│  │   │NC Head  │ │DAC Head│ │TTC Head│ │C Head  │ │EP Head ││   │
│  │   │(collisn)│ │(drivbl)│ │(time-  │ │(comfrt)│ │(progrs)││   │
│  │   │         │ │        │ │ to-col)│ │        │ │        ││   │
│  │   └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘│   │
│  │       │          │          │          │          │       │   │
│  │       └──────────┴──────┬───┴──────────┴──────────┘       │   │
│  │                         ▼                                 │   │
│  │              Composite Score per Trajectory                │   │
│  └──────────────────────┬───────────────────────────────────┘   │
│                         ▼                                        │
│              ┌──────────────────────┐                            │
│              │  Best Trajectory      │                            │
│              │  (argmax composite)   │                            │
│              └──────────────────────┘                            │
│                                                                  │
│  TEACHER SUPERVISION (training only):                            │
│  ┌──────────────────────────────────────────────────────┐       │
│  │  Offline Sim (GT perception) ──► per-metric labels    │       │
│  │  Human demos ──► expert trajectory targets            │       │
│  └──────────────────────────────────────────────────────┘       │
└──────────────────────────────────────────────────────────────────┘

Architecture overview

Perception Network

Hydra-MDP builds on the TransFuser architecture for sensor fusion. The perception network processes front-view camera images and LiDAR point clouds through separate backbones (e.g., ResNet or larger vision models), fusing them via transformer layers to produce environmental tokens encoding rich semantic information about the driving scene.

Planning Vocabulary

Rather than predicting continuous trajectories directly, Hydra-MDP discretizes the action space using a planning vocabulary created by K-means clustering of ~700,000 expert trajectories from the training dataset down to 4,096–8,192 representative trajectories (V4096 and V8192 variants). Each trajectory in the vocabulary consists of 40 timesteps of (x, y, heading) coordinates over a 4-second planning horizon. At inference, the model scores every trajectory in the vocabulary and selects the highest-scoring one.

Multi-Target Hydra Heads

The trajectory decoder uses a set of specialized prediction heads -- one per evaluation metric:

Head	Metric	Description
NC Head	No at-fault Collisions	Predicts collision-free probability
DAC Head	Drivable Area Compliance	Predicts road boundary compliance
TTC Head	Time to Collision	Predicts time-to-collision safety margin
C Head	Comfort	Predicts jerk/acceleration comfort score
EP Head	Ego Progress	Predicts forward progress along route

Each head is trained with supervision from offline simulation: for every trajectory in the planning vocabulary, a rule-based simulator with ground-truth perception evaluates the trajectory against each metric, producing per-metric labels. The student model learns to predict these scores from raw sensor inputs.

Teacher-Student Paradigm

Human teacher: Provides expert trajectory demonstrations from log-replay data
Rule-based teachers: Offline simulation models that evaluate trajectory candidates against closed-loop metrics using ground truth perception

The key advantage of this paradigm is that teachers operate with perfect perception (ground truth), while the student must learn from noisy real sensor observations. This separation forces the student to develop robust perception-to-planning mappings that generalize to real-world conditions.

Results

Hydra-MDP achieved first place in the NAVSIM challenge with state-of-the-art performance across all metrics:

Method	PDM Score	NC (%)	DAC (%)	TTC	Comfort	EP
Hydra-MDP (best single)	86.5	98.3	96.0	--	--	--
Hydra-MDP (large backbone)	91.0	--	--	--	--	--
TransFuser baseline	78.0	97.2	89.1	--	--	76.0
Single aggregated score distillation	80.2	--	--	--	--	--

Key findings from ablations:

Multi-target vs. single-score distillation: Distilling a single aggregated score instead of per-metric scores causes significant performance degradation, confirming the multi-target approach is essential
Planning vocabulary size: Larger vocabularies consistently improve all metrics, highlighting the benefit of richer trajectory candidate sets
Backbone scaling: Performance scales strongly with larger vision backbones (up to 91.0 PDM), contradicting prior claims of diminishing returns in E2E driving

Limitations & Open Questions

Dependence on planning vocabulary: The discrete trajectory set is fixed at training time; novel maneuvers outside the vocabulary cannot be generated
Offline simulation fidelity: Teacher labels come from rule-based simulation with GT perception -- the quality of distillation depends on simulator fidelity
LiDAR dependency: The TransFuser backbone requires LiDAR input, limiting deployment to LiDAR-equipped vehicles
Closed-loop validation: While winning NAVSIM (a pseudo-simulation benchmark), full closed-loop and real-world deployment results are not reported
Vocabulary scalability: As vocabulary size grows, inference cost increases linearly; efficient retrieval or hierarchical scoring could improve scalability