ESC

📄 Read on arXiv

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

Overview

VectorNet (Gao et al., Waymo/Google, CVPR 2020) is a foundational paper that moved motion prediction and map encoding away from rasterized image-based representations toward vectorized, graph-based structures. Prior to VectorNet, the dominant approach for encoding HD maps and agent dynamics was to rasterize everything into multi-channel bird's-eye-view images and process them with CNNs. This was computationally expensive, threw away structural information (a lane boundary is naturally a polyline, not a pixel grid), and scaled poorly with map size.

VectorNet represents all scene elements -- lane boundaries, crosswalks, traffic signals, agent trajectories -- as sets of polylines, where each polyline is a sequence of vectors (directed line segments). A hierarchical graph neural network first encodes local structure within each polyline (a subgraph of vectors) and then models global interactions between polylines (a graph of polyline-level nodes). This unified representation treats maps and agents identically, using the same vector-based encoding for both static geometry and dynamic trajectories.

The paper's influence was substantial: VectorNet shifted the motion prediction community toward vectorized representations, directly inspiring subsequent work including LaneGCN, TNT, DenseTNT, and eventually the VAD planning paper. The key insight -- that structured, sparse representations preserve geometric and topological information while being more efficient than dense rasterization -- became a guiding principle for autonomous driving perception and prediction architectures.

Key Contributions

  • Unified vectorized representation: Represents all scene elements (map lanes, crosswalks, traffic lights, agent trajectories) as sets of polylines composed of directed vectors, providing a unified encoding for both static and dynamic elements
  • Hierarchical graph neural network: Two-level GNN architecture where the subgraph network encodes local structure within each polyline and the global interaction graph models relationships between polylines
  • Elimination of rasterized rendering: Removes the need to render HD maps and agent histories into multi-channel BEV images, avoiding information loss from discretization and reducing computational cost
  • Self-supervised auxiliary task: Uses a node completion pre-training objective (predict a masked polyline node from context) to learn better representations, analogous to masked language modeling in NLP
  • State-of-the-art motion prediction: Achieves top results on the Argoverse motion forecasting benchmark at the time of publication

Architecture

┌──────────────────────────────────────────────────────────┐
                  VectorNet Architecture                    
                                                           
  Scene Elements (unified vector representation):          
  ┌───────────┐ ┌───────────┐ ┌───────────┐               
   Lane Bdry   Crosswalk  Agent Traj    ...          
   v1v2v3    v1v2v3    v1v2v3                 
  └─────┬─────┘ └─────┬─────┘ └─────┬─────┘               
                                                        
                                                        
  ┌───────────────────────────────────────────┐            
     Polyline Subgraph Networks (local GNN)              
     3 layers message passing + max-pool                 
                                                         
     v1,v2,v3 ──► p_j (polyline node)                    
  └──────┬────────────┬───────────┬───────────┘            
                                                        
                                                        
       [p_1]        [p_2]       [p_3] ...  [p_N]           
                                                       
         └────────────┴───────────┴──────────┘             
                                                          
                                                          
  ┌───────────────────────────────────────────┐            
     Global Interaction Graph                            
     (fully-connected, multi-head self-attn)             
     Each polyline attends to all others                 
  └──────────────────┬────────────────────────┘            
                                                         
                                                         
  ┌───────────────────────────────┐                        
     MLP Prediction Head                                
     Single future trajectory                           
     (L2 regression loss)                               
  └───────────────────────────────┘                        
└──────────────────────────────────────────────────────────┘

Architecture / Method

Rasterized representation vs VectorNet's vectorized representation

VectorNet hierarchical architecture: polyline subgraphs and global interaction graph

VectorNet processes the driving scene in two stages. Stage 1 -- Polyline Subgraph Encoding: Each scene element (a lane segment, an agent trajectory, a crosswalk boundary) is represented as an ordered sequence of vectors v_i = (d_x, d_y, attributes), where d_x, d_y are the displacement from the previous point and attributes include semantic type, timestamp (for trajectories), and other features. Within each polyline, a local subgraph GNN (3 layers of message passing with max-pool aggregation) encodes the vectors into a single polyline-level feature node p_j. This captures the internal structure of each element (lane curvature, trajectory shape).

Stage 2 -- Global Interaction Graph: The polyline-level nodes {p_j} form a fully-connected global graph. A global interaction network (multi-head self-attention, similar to a transformer) processes these nodes, allowing each polyline to attend to all others. This captures long-range spatial relationships: an agent's trajectory attends to nearby lane boundaries, traffic signals, and other agents' trajectories.

Prediction Head: The output node for the target agent is passed through an MLP to predict a single future trajectory. VectorNet does not include a multimodal decoder with multiple hypotheses or winner-take-all loss — that capability came in later work (TNT, DenseTNT).

Self-supervised Pre-training: A node completion objective randomly masks a polyline node and trains the network to predict its features from the remaining context, similar to BERT's masked language modeling. This pre-training improves downstream prediction performance.

Results

VectorNet predictions vs ground truth trajectories in driving scenarios

Resource VectorNet ResNet-18 Baseline Reduction
Parameters 72K 246K 70%
FLOPs 0.041 GFLOPs 10.56 GFLOPs 99.6%
  • State-of-the-art on Argoverse motion forecasting: Achieves top minADE and minFDE metrics at the time of publication, outperforming CNN-based rasterized approaches
  • Computational efficiency: Over 200x fewer FLOPs than the best rasterized CNN baseline for a single agent (10.56G vs. 0.041G, about 99.6% fewer) while achieving better prediction accuracy, because vectorized representations scale with the number of scene elements rather than spatial resolution
  • Self-supervised pre-training improves prediction: The node completion pre-training objective improves final prediction metrics by 5-8%, demonstrating that the graph structure supports effective self-supervised learning
  • Unified encoding validated: Using the same vector representation for both maps and agents outperforms architectures that encode them with different modules, confirming the value of a unified representation
  • Scalability to large scenes: Performance remains stable as the number of map elements increases, unlike rasterized approaches where computational cost grows with map area at fixed resolution
  • Ablations validate hierarchy: Both the subgraph (local) and global interaction networks are necessary; removing either degrades performance significantly

Limitations & Open Questions

  • The fully-connected global graph has O(N^2) complexity in the number of polylines, which can become expensive in dense urban scenes with hundreds of lane segments and agents
  • The vector representation loses some fine-grained geometric information (road surface texture, elevation changes) that rasterized representations can capture through image channels
  • The prediction head is a simple MLP predicting a single trajectory; multimodal prediction (multiple hypotheses with confidence scores) is not part of this paper and was addressed in subsequent work (TNT, DenseTNT)
  • No explicit incorporation of traffic rules or semantic constraints beyond what the GNN learns implicitly from data

Connections