ESC

πŸ“„ Read on arXiv

Overview

End-to-end autonomous driving systems suffer from a critical limitation: temporal inconsistency. Current systems operate in a "one-shot" manner, making trajectory predictions based solely on the current perception frame. This leads to vehicle trembling (unstable consecutive predictions), vulnerability to temporary occlusions, and safety concerns from abrupt trajectory changes. MomAD (Momentum-Aware Driving) addresses these challenges by incorporating two types of "momentum" into end-to-end planning.

The framework introduces Trajectory Momentum (ensuring temporal coherence by selecting candidate trajectories that align with previously executed paths) and Perception Momentum (enriching current planning with historical context to improve long-horizon understanding). Two novel technical components implement these concepts: Topological Trajectory Matching (TTM) uses Hausdorff distance to select temporally consistent trajectory candidates, while the Momentum Planning Interactor (MPI) uses LSTM-based historical query fusion with cross-attention to inject perception momentum into planning.

MomAD achieves state-of-the-art results on nuScenes (0.60m L2 error, 0.09% collision rate) and introduces a new Trajectory Prediction Consistency (TPC) metric measuring planning stability. In closed-loop evaluation on Bench2Drive, it improves success rate by 16.3% over VAD and 8.4% over SparseDrive, with 7.2% better trajectory smoothness.

Key Contributions

  • Identifies temporal inconsistency as a critical but overlooked failure mode in end-to-end driving planners
  • Introduces two forms of momentum: trajectory momentum (temporal coherence in trajectory selection) and perception momentum (historical context enrichment)
  • Topological Trajectory Matching (TTM) module using Hausdorff distance for temporally consistent multi-modal trajectory selection
  • Momentum Planning Interactor (MPI) with LSTM-based surrogate query and cross-attention for historical context fusion
  • New Trajectory Prediction Consistency (TPC) metric and Turning-nuScenes dataset for evaluating planning stability

Architecture / Method

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MomAD Architecture                          β”‚
β”‚                                                               β”‚
β”‚  Multi-view ──► Sparse Perception ──► Instance Features       β”‚
β”‚  Images         Backbone              (agents + map)          β”‚
β”‚                                           β”‚                   β”‚
β”‚                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                          β”‚                                β”‚   β”‚
β”‚                          β–Ό                                β–Ό   β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚               β”‚ Trajectory Momentumβ”‚          β”‚  Perception β”‚  β”‚
β”‚               β”‚ (TTM)             β”‚          β”‚  Momentum   β”‚  β”‚
β”‚               β”‚                   β”‚          β”‚  (MPI)      β”‚  β”‚
β”‚               β”‚ K candidates ──►  β”‚          β”‚             β”‚  β”‚
β”‚               β”‚ Hausdorff dist.   β”‚          β”‚ Hist. query β”‚  β”‚
β”‚               β”‚ vs. history ──►   β”‚          β”‚   ──► LSTM  β”‚  β”‚
β”‚               β”‚ Best-aligned      β”‚          β”‚   ──► Cross β”‚  β”‚
β”‚               β”‚ selection         β”‚          β”‚      Attn.  β”‚  β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                        β”‚                            β”‚         β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                   β–Ό                           β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚                        β”‚  Refined Trajectory β”‚                 β”‚
β”‚                        β”‚  (temporally smooth)β”‚                 β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture

MomAD builds upon sparse perception backbones (similar to SparseDrive) with a momentum-aware planning module. Multi-view images are processed into instance features for road agents and map elements.

Topological Trajectory Matching (TTM): 1. Generate K multi-modal candidate trajectories for the current timestep 2. Transform candidates and historical optimal trajectory into a common coordinate system via rotation and translation matrices 3. Apply Hausdorff distance to measure alignment -- captures maximum deviation between trajectory sets, ensuring both local and global alignment while being less sensitive to point density variations 4. Select the candidate best aligned with historical execution path

Momentum Planning Interactor (MPI): 1. Combine historical planning queries with their scores via element-wise multiplication 2. Process through LSTM to create a surrogate multi-modal query capturing temporal evolution 3. Cross-attention: current query attends to historical surrogate query 4. Generate refined trajectories using enriched query combined with current instance features

Robust Instance Denoising: Training-time controlled perturbations to instance features teach the model to filter perception noise.

Results

Results

nuScenes (Open-Loop)

Model L2 Error (m) Collision (%) TPC (m)
UniAD 1.03 0.31 0.96
VAD 0.72 0.21 0.64
SparseDrive 0.61 0.08 0.57
MomAD 0.60 0.09 0.54

Turning-nuScenes (6-second horizon)

Metric Improvement vs SparseDrive
Collision rate -26%
TPC -33.45% (better consistency)
L2 error -25.30%

Bench2Drive (Closed-Loop)

Model Success Rate Comfortness
VAD baseline baseline
SparseDrive -- --
MomAD +16.3% vs VAD / +8.4% vs SparseDrive +7.2%

Inference speed: 7.8 FPS on RTX 4090.

Limitations & Open Questions

  • The LSTM-based historical query fusion adds sequential dependency that may limit parallelization and real-time performance at higher frame rates
  • Evaluation on Bench2Drive closed-loop is encouraging but the gap between closed-loop simulation and real-world deployment remains unaddressed
  • Whether momentum-aware planning composes well with VLA systems (which have their own temporal reasoning through autoregressive generation) is unexplored

Connections