ESC

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Read on arXiv

Overview

Driving with LLMs (Wayve, ICRA 2024) is one of the first concrete demonstrations of using a large language model as the decision-making "brain" for autonomous driving. The paper addresses the "black box" problem in E2E driving by fusing object-level vector representations with a frozen LLaMA-7B model, enabling both driving action prediction and natural language explanations of driving decisions.

The system introduces a novel three-component architecture (Vector Encoder, Vector Former, and LLM with LoRA) trained in two stages: vector representation pretraining followed by driving QA finetuning. A key innovation is the data generation pipeline using a PPO-trained RL agent in a custom 2D simulator, with a lanGen function translating vector state into natural language, and GPT-3.5 generating 160k QA pairs for training. The system demonstrates improved perception and action prediction while providing interpretable explanations.

Key Contributions

  • First concrete LLM-for-driving system: Among the earliest papers to demonstrate a working LLM integrated into an autonomous driving pipeline with measurable driving performance
  • Object-level vector modality: Novel approach of representing driving scenes as vectorized object-level features (positions, velocities, types) rather than raw images, enabling efficient LLM integration
  • Two-stage training pipeline: Vector representation pretraining captures driving-relevant features before QA finetuning aligns the model with language-based driving reasoning
  • Scalable data generation: Uses PPO-trained RL agent + GPT-3.5 to generate 160k driving QA pairs, establishing a template for LLM driving data creation
  • Explainable decision-making: The LLM can articulate reasons for its driving decisions in natural language, addressing the black-box problem

Architecture / Method

System overview

┌──────────────────────────────────────────────────────────────┐
         Driving with LLMs: Vector-to-Language Pipeline         
                                                               
  Driving Scene (from custom 2D simulator)                     
  ┌────────────────────────────────────────┐                  
   Object Vectors: [pos, vel, heading,                      
    dims, type] for each agent + road                       
    elements + ego state                                    
  └──────────────────┬─────────────────────┘                  
                                                              
  ┌──────────────────────────────┐                            
        Vector Encoder                                      
    (per-object feature embed)                              
  └──────────────┬───────────────┘                            
                                                              
  ┌──────────────────────────────┐                            
        Vector Former                                       
    (Transformer self-attn                                  
     across all object tokens)                              
    ──► Fixed-size scene embed                              
  └──────────────┬───────────────┘                            
                                                              
  ┌──────────────────────────────────────────┐                
     Frozen LLaMA-7B + LoRA                                 
     Input: [scene embed] + [text query]                    
     Output: action tokens + explanation                    
  └──────────────────────────────────────────┘                
                                                               
  Two-Stage Training:                                          
  ┌──────────────┐      ┌──────────────────────┐             
   Stage 1:            Stage 2:                          
   Vector       │─────►│ Driving QA Finetuning             
   Pretraining         (160k QA pairs from               
   (perception)         PPO agent + GPT-3.5)             
  └──────────────┘      └──────────────────────┘             
└──────────────────────────────────────────────────────────────┘

The architecture consists of three main components:

1. Vector Encoder

Encodes the driving scene as a set of object-level vectors: - Each traffic participant is represented by a feature vector: position (x, y, z), velocity (vx, vy), heading, dimensions, and object type - Road elements (lanes, traffic lights, signs) are similarly vectorized - Ego vehicle state (velocity, acceleration, heading) is included - Cross-attention layers map inputs into latent space; ego vehicle features are added to each learned latent vector to maintain consistent self-awareness

2. Vector Former

Architecture details

A transformer module that processes the vectorized scene: - Self-attention and cross-attention layers interact between latent space and question tokens - Generates 64 queries that serve as the fixed-size interface between the numeric vector representation and the LLM's token space - Acts as a bridge between the vector representation and the LLM's token space

3. Frozen LLaMA-7B with LoRA

The scene embedding from Vector Former is projected into the LLM's input space: - LLaMA-7B is kept frozen; only LoRA adapters are trained - Input: scene embedding tokens + text query tokens - Output: driving action tokens + natural language explanation

Training Pipeline

Stage 1 -- Vector Pretraining: The Vector Encoder and Vector Former are pretrained on 100k pseudo vector-captioning QA pairs (derived from 100k simulated scenarios) to learn driving-relevant representations. The language model remains frozen during this stage.

Stage 2 -- QA Finetuning: The full system (Vector Former + LoRA adapters) is finetuned on 160k driving QA pairs:

Data generation

  • A PPO-trained RL agent drives in the custom 2D simulator
  • The lanGen function translates simulator state vectors into structured natural language descriptions
  • GPT-3.5 generates diverse question-answer pairs about driving decisions, perceptions, and explanations
  • Questions cover: "What should you do?", "Why?", "How many cars ahead?", "What is the speed limit?"

Results

Results

Perception (Mean Absolute Error):

Metric Without Pretraining With Pretraining
Car count MAE 0.101 0.066
Pedestrian count MAE 1.668 0.313
Token prediction loss (L_token) 0.644 0.502

Action prediction vs. Perceiver-BC baseline:

Control axis LLM-Driver Perceiver-BC
Longitudinal error 0.066 0.180
Lateral error 0.014 0.111

Driving QA scores (0–10 scale):

Grader Without Pretraining With Pretraining
GPT-3.5 automated 7.48 8.39
Human 6.63 7.71
  • Pretraining substantially improves perception: Car count MAE drops from 0.101 to 0.066 and pedestrian count MAE drops from 1.668 to 0.313 with vector pretraining
  • Strong action prediction: LLM-Driver outperforms Perceiver-BC substantially on longitudinal (0.066 vs 0.180) and lateral (0.014 vs 0.111) control error
  • Traffic light distance regression: Perceiver-BC (D_TL: 0.410) outperforms LLM-Driver (D_TL: 6.624) on this specific regression task, a known weakness of the approach
  • QA quality improves ~9–11% with pretraining: GPT grading 8.39 vs 7.48; human grading 7.71 vs 6.63
  • Multi-task capability: A single model handles perception queries, action prediction, and explanation generation

Additional results

Limitations & Open Questions

  • Simulation only: All experiments are in a custom 2D simulator; real-world transfer is not demonstrated
  • Open-loop evaluation: No closed-loop driving evaluation; the LLM predicts actions but does not drive in a feedback loop
  • Vector input only: Does not process raw camera images, limiting applicability to scenarios where a good perception system is already available
  • Frozen LLM: LLaMA-7B is not fine-tuned end-to-end; full fine-tuning might unlock better integration
  • Latency concerns: LLM inference at 7B parameters may be too slow for real-time driving (not measured in paper)
  • GPT-3.5 data quality: Generated QA pairs may contain errors or inconsistencies that limit training quality
  • Numeric regression weakness: On traffic light distance estimation, Perceiver-BC (D_TL: 0.410) substantially outperforms LLM-Driver (D_TL: 6.624), indicating the LLM struggles with precise numeric regression tasks

Connections