GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation

Overview

GoalFlow (Horizon Robotics / HKU, CVPR 2025) introduces a goal-driven flow matching framework for multimodal trajectory generation in autonomous driving. The method achieves 90.3 PDMS on the NAVSIM benchmark -- the highest score reported at the time of publication -- while requiring only a single denoising step at inference, making it both accurate and extremely efficient.

Flow matching is a generative modeling framework that learns to transport samples from a simple prior distribution to the data distribution along straight-line paths, offering computational advantages over diffusion models. GoalFlow extends flow matching by conditioning the generation process on goal points, which provide high-level routing intent and help resolve the multimodality inherent in driving behavior. The combination of goal conditioning with flow matching enables the model to generate diverse, plausible trajectories aligned with navigation intent in a single forward pass.

Key Contributions

Goal-driven flow matching for driving: First application of conditional flow matching to trajectory planning with explicit goal conditioning, combining the efficiency of flow matching with interpretable goal-directed behavior
Single-step inference: Unlike diffusion-based planners that need multiple denoising steps, GoalFlow generates high-quality trajectories in a single step, enabling very fast inference
90.3 PDMS on NAVSIM: Establishes a new state-of-the-art on the NAVSIM benchmark, surpassing DiffusionDrive (88.1) and prior methods by a substantial margin
Goal-conditioned multimodality: A vocabulary of clustered training-data endpoints (4096–8192 points) scored via Distance + DAC dual scoring naturally decomposes the multimodal trajectory distribution into mode-specific subproblems, improving both diversity and accuracy

Architecture / Method

┌─────────────────────────────────────────────────────────────┐
│              GoalFlow: Goal-Driven Flow Matching             │
│                                                             │
│  Multi-Camera Images + LiDAR                                │
│       │                                                     │
│       ▼                                                     │
│  ┌──────────────────────────────┐                           │
│  │  Transfuser-based BEV Encoder│                           │
│  │  (camera + LiDAR fusion)     │ ──► HD map seg, 3D bbox   │
│  └──────────────┬───────────────┘     (aux supervision)     │
│                 │                                           │
│                 │          Goal Point Vocabulary            │
│                 │       (N=4096-8192 clustered GT EPs)      │
│                 │                 │                         │
│                 │         ┌───────┴──────────────┐          │
│                 │         │ Dual Scoring:         │          │
│                 │         │  Distance Score       │          │
│                 │         │  DAC Score (shadow    │          │
│                 │         │  vehicle check)       │          │
│                 │         └───────┬──────────────┘          │
│                 │                 │ best goal g*             │
│                 └────────┬────────┘                         │
│                          ▼                                  │
│  ┌─────────────────────────────────────────┐                │
│  │   Flow Matching Trajectory Generator    │                │
│  │                                         │                │
│  │   x_0 ~ N(0, σ=0.1)  (128-256 samples) │                │
│  │     │                                   │                │
│  │     ▼                                   │                │
│  │   v_θ(x_0, t=0, c)  ◄── BEV features  │                │
│  │     │                     + goal g*     │                │
│  │     ▼  (single Euler step, 10.4ms)      │                │
│  │   x_1 = x_0 + v_θ   ──► trajectory     │                │
│  │                                         │                │
│  │   (shadow trajectory fallback if        │                │
│  │    goal appears unreliable)             │                │
│  └─────────────────────────────────────────┘                │
│                                                             │
│  Key: Single denoising step ──► 10.4ms inference            │
└─────────────────────────────────────────────────────────────┘

GoalFlow consists of three main components:

Perception Module (Transfuser-based): Multi-view camera images and LiDAR data are fused into a Bird's Eye View (BEV) representation using a Transfuser-based architecture. Auxiliary supervision includes HD map segmentation and 3D bounding box detection tasks (both cross-entropy and L1 losses).
Goal Point Construction Module: A vocabulary of N candidate goal points (N=4096–8192) is built by clustering ground-truth trajectory endpoints from training data. At inference, each candidate is scored via a dual scoring mechanism: - Distance Score: softmax of negative Euclidean distance from the candidate to the ground-truth endpoint, measuring proximity. - DAC Score (Drivable Area Compliance): binary evaluation placing a "shadow vehicle" at each candidate and checking whether it stays within the drivable area polygon. - Final selection uses weighted aggregation of both scores. A shadow trajectory mechanism defaults to non-guided trajectories when the best goal point appears unreliable.
Flow Matching Trajectory Generator: For each selected goal, a conditional flow matching model learns the velocity field v_θ(x_t, t) that transports a noisy sample x_0 ~ N(0, σ=0.1) to the data distribution along straight-line paths (x_t = (1-t)x_0 + t·x_1). The network encodes x_t, the time step, the goal point, and BEV features through Transformer attention layers. During inference 128–256 trajectory candidates are generated.

Training losses — three components: - Perception Loss: cross-entropy for HD map segmentation and 3D bbox classification; L1 for locations. - Goal Loss: cross-entropy supervising both the distance score and the DAC score predictions. - Planner Loss: L1 minimizing the difference between predicted and ground-truth trajectory shifts.

Inference: GoalFlow evaluates the velocity field at t=0 and takes a single Euler step: x_1 = x_0 + v_θ(x_0, 0, c). Single-step inference runs in 10.4 ms (vs. 177.8 ms for multi-step), with only a 1.6% PDM score drop compared to the optimal multi-step result.

Trajectory Selection: The highest-scoring candidate (by the dual scoring mechanism) is selected; if the goal point appears unreliable the shadow trajectory mechanism falls back to non-guided generation.

Results

Method	PDMS (NAVSIM)	Steps	FPS
GoalFlow	90.3	1	~60
DiffusionDrive	88.1	2	45
GenAD	83.5	20	~3
VAD	80.8	-	~10
UniAD	79.2	-	~5

90.3 PDMS on NAVSIM, +2.2 over DiffusionDrive and +9.5 over VAD
Single-step inference at 10.4ms (6% of the 177.8ms multi-step baseline), with only 1.6% PDM drop
Ablation progression: M0 base flow matching = 85.6, +Distance Score = 88.5, +DAC Score = 89.4, +Trajectory Scorer = 90.3
Oracle goal points (GoalFlow†) reach 92.1 PDMS, approaching human driving ceiling of 94.8
Goal vocabulary uses N=4096–8192 clustered endpoints; larger vocabularies and stronger image backbones consistently improve performance
Flow matching outperforms diffusion for the single-step regime -- the straight-line interpolation paths enable better single-step approximation than DDIM/DDPM shortcuts
Multimodal trajectory diversity is maintained through the goal point vocabulary: different selected goals produce distinct trajectory modes

Limitations & Open Questions

Open-loop evaluation on NAVSIM only; closed-loop simulation and real-world testing remain to be demonstrated
The goal proposal module relies on accurate route information; degraded route inputs could significantly impact performance
Single-step flow matching may sacrifice fine-grained trajectory quality compared to multi-step methods in complex scenarios
The method's advantage over DiffusionDrive may partly stem from the goal conditioning rather than flow matching itself; ablations separating these factors would be valuable

Connections

Autonomous Driving -- end-to-end trajectory planning
Planning -- generative trajectory planning with flow matching
Diffusiondrive Truncated Diffusion Model For End To End Autonomous Driving -- complementary diffusion-based approach on NAVSIM
Vad Vectorized Scene Representation For Efficient Autonomous Driving -- vectorized planning baseline
Planning Oriented Autonomous Driving -- UniAD modular E2E baseline
Denoising Diffusion Probabilistic Models -- diffusion models, contrasted with flow matching