π Read on arXiv
Overview
End-to-end autonomous driving systems suffer from a critical limitation: temporal inconsistency. Current systems operate in a "one-shot" manner, making trajectory predictions based solely on the current perception frame. This leads to vehicle trembling (unstable consecutive predictions), vulnerability to temporary occlusions, and safety concerns from abrupt trajectory changes. MomAD (Momentum-Aware Driving) addresses these challenges by incorporating two types of "momentum" into end-to-end planning.
The framework introduces Trajectory Momentum (ensuring temporal coherence by selecting candidate trajectories that align with previously executed paths) and Perception Momentum (enriching current planning with historical context to improve long-horizon understanding). Two novel technical components implement these concepts: Topological Trajectory Matching (TTM) uses Hausdorff distance to select temporally consistent trajectory candidates, while the Momentum Planning Interactor (MPI) uses LSTM-based historical query fusion with cross-attention to inject perception momentum into planning.
MomAD achieves state-of-the-art results on nuScenes (0.60m L2 error, 0.09% collision rate) and introduces a new Trajectory Prediction Consistency (TPC) metric measuring planning stability. In closed-loop evaluation on Bench2Drive, it improves success rate by 16.3% over VAD and 8.4% over SparseDrive, with 7.2% better trajectory smoothness.
Key Contributions
- Identifies temporal inconsistency as a critical but overlooked failure mode in end-to-end driving planners
- Introduces two forms of momentum: trajectory momentum (temporal coherence in trajectory selection) and perception momentum (historical context enrichment)
- Topological Trajectory Matching (TTM) module using Hausdorff distance for temporally consistent multi-modal trajectory selection
- Momentum Planning Interactor (MPI) with LSTM-based surrogate query and cross-attention for historical context fusion
- New Trajectory Prediction Consistency (TPC) metric and Turning-nuScenes dataset for evaluating planning stability
Architecture / Method
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MomAD Architecture β
β β
β Multi-view βββΊ Sparse Perception βββΊ Instance Features β
β Images Backbone (agents + map) β
β β β
β ββββββββββββββββββ΄ββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββ βββββββββββββββ β
β β Trajectory Momentumβ β Perception β β
β β (TTM) β β Momentum β β
β β β β (MPI) β β
β β K candidates βββΊ β β β β
β β Hausdorff dist. β β Hist. query β β
β β vs. history βββΊ β β βββΊ LSTM β β
β β Best-aligned β β βββΊ Cross β β
β β selection β β Attn. β β
β ββββββββββ¬βββββββββββ ββββββββ¬βββββββ β
β β β β
β ββββββββββββ¬ββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββ β
β β Refined Trajectory β β
β β (temporally smooth)β β
β ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ

MomAD builds upon sparse perception backbones (similar to SparseDrive) with a momentum-aware planning module. Multi-view images are processed into instance features for road agents and map elements.
Topological Trajectory Matching (TTM): 1. Generate K multi-modal candidate trajectories for the current timestep 2. Transform candidates and historical optimal trajectory into a common coordinate system via rotation and translation matrices 3. Apply Hausdorff distance to measure alignment -- captures maximum deviation between trajectory sets, ensuring both local and global alignment while being less sensitive to point density variations 4. Select the candidate best aligned with historical execution path
Momentum Planning Interactor (MPI): 1. Combine historical planning queries with their scores via element-wise multiplication 2. Process through LSTM to create a surrogate multi-modal query capturing temporal evolution 3. Cross-attention: current query attends to historical surrogate query 4. Generate refined trajectories using enriched query combined with current instance features
Robust Instance Denoising: Training-time controlled perturbations to instance features teach the model to filter perception noise.
Results

nuScenes (Open-Loop)
| Model | L2 Error (m) | Collision (%) | TPC (m) |
|---|---|---|---|
| UniAD | 1.03 | 0.31 | 0.96 |
| VAD | 0.72 | 0.21 | 0.64 |
| SparseDrive | 0.61 | 0.08 | 0.57 |
| MomAD | 0.60 | 0.09 | 0.54 |
Turning-nuScenes (6-second horizon)
| Metric | Improvement vs SparseDrive |
|---|---|
| Collision rate | -26% |
| TPC | -33.45% (better consistency) |
| L2 error | -25.30% |
Bench2Drive (Closed-Loop)
| Model | Success Rate | Comfortness |
|---|---|---|
| VAD | baseline | baseline |
| SparseDrive | -- | -- |
| MomAD | +16.3% vs VAD / +8.4% vs SparseDrive | +7.2% |
Inference speed: 7.8 FPS on RTX 4090.
Limitations & Open Questions
- The LSTM-based historical query fusion adds sequential dependency that may limit parallelization and real-time performance at higher frame rates
- Evaluation on Bench2Drive closed-loop is encouraging but the gap between closed-loop simulation and real-world deployment remains unaddressed
- Whether momentum-aware planning composes well with VLA systems (which have their own temporal reasoning through autoregressive generation) is unexplored
Connections
- Vad Vectorized Scene Representation For Efficient Autonomous Driving -- VAD baseline; MomAD improves +16.3% success rate in closed-loop
- Planning Oriented Autonomous Driving -- UniAD baseline; MomAD addresses temporal consistency gap in joint E2E systems
- Senna Bridging Large Vision Language Models And End To End Autonomous Driving -- Complementary approach: Senna addresses reasoning, MomAD addresses temporal consistency
- Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model -- Both address trajectory quality verification, from different angles (world model vs momentum)
- Carla An Open Urban Driving Simulator -- Bench2Drive closed-loop evaluation environment