ESC

📄 Read on arXiv

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States

Overview

DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than replacing the entire pipeline end-to-end. The core innovation is aligning high-level MLLM decisions with standardized behavioral planning states that downstream motion planners can execute directly, bridging the gap between language model outputs and vehicle control.

Most driving VLMs aim to be end-to-end replacements for the entire driving stack, but DriveMLM takes a pragmatically different approach. Existing modular AD systems have battle-tested perception and control modules developed over years of engineering. Rather than discarding this investment, DriveMLM inserts an LLM between perception outputs and the motion planner, using it specifically for behavioral planning -- the high-level decision-making about what maneuver to execute. The LLM receives multimodal inputs (camera images, LiDAR point clouds) and produces behavioral planning states plus natural language explanations.

Key Contributions

  • LLM as plug-and-play behavioral planning module: Standardizes behavioral planning states using an off-the-shelf motion planning module and trains a multimodal LLM to output driving decisions plus explanations that interface with existing AD stacks
  • Compatible with Apollo/Autoware: Designed for integration with industry-standard AD frameworks, not requiring end-to-end replacement
  • Behavioral state abstraction: Defines an intermediate representation bridging LLM output and motion planner input -- speed decisions [KEEP, ACCELERATE, DECELERATE, STOP] and path decisions [FOLLOW, LEFT_CHANGE, RIGHT_CHANGE, LEFT_BORROW, RIGHT_BORROW]
  • Efficient data collection strategy: 280 hours of driving data across 50,000 routes with automatic decision annotation from expert trajectories, avoiding costly manual labeling
  • Driving score improvements on CARLA demonstrating the viability of the modular LLM integration approach

Architecture / Method

DriveMLM framework overview: multimodal LLM as plug-and-play behavioral planning module

MLLM Planner architecture: multi-modal tokenizer and MLLM decoder components

┌──────────────────────────────────────────────────────────────┐
            DriveMLM: LLM as Plug-and-Play Planner             
                                                               
  ┌───────────────────────────────────────────────────┐       
             Multi-Modal Tokenizer                           
                                                             
    ┌──────────┐  ┌──────────────┐  ┌────────────┐         
     Multi-cam   LiDAR Points    System Msg          
     Images                      (decisions)         
    └────┬─────┘  └──────┬───────┘  └─────┬──────┘         
                                                         
    ┌─────────┐   ┌────────────┐                           
    CLIP ViT     SST LiDAR                             
    + Temporal    Encoder                               
    CrossAttn    (CLIP-align)                           
    └────┬─────┘   └─────┬──────┘                           
         └───────┬────────┘                                 
                                                           
           Visual Tokens                                    
  └───────────────┬──────────────────────────┘                
                                                              
  ┌───────────────────────────────────────────┐               
          LLaMA-7B MLLM Decoder                             
    Input: [visual tokens] + [system msg]                   
  └──────────────────┬────────────────────────┘               
                                                              
  ┌─────────────────────────────────────┐                     
   Output:                                                  
    Speed:  KEEP|ACCEL|DECEL|STOP                           
    Path:   FOLLOW|LEFT|RIGHT|BORROW                        
    + Natural Language Explanation                           
  └──────────────────┬──────────────────┘                     
                                                              
         ┌──────────────────────┐                              
          Apollo / Autoware                                  
          Motion Planner       │──► Vehicle Control           
         └──────────────────────┘                              
└──────────────────────────────────────────────────────────────┘

DriveMLM's architecture consists of three main components.

Behavioral Planning States Alignment: The framework defines two decision categories: Speed Decisions [KEEP, ACCELERATE, DECELERATE, STOP] and Path Decisions [FOLLOW, LEFT CHANGE, RIGHT CHANGE, LEFT BORROW, RIGHT BORROW]. These are incorporated into system messages fed to the MLLM planner, ensuring predictions converge into predefined decisions executable by downstream motion planning and control modules. At each timestep, one speed and one path decision are generated.

Multi-Modal MLLM Planner: Processes diverse sensor inputs through a multi-modal tokenizer and MLLM decoder (LLaMA-7B). Temporal multi-view images are processed using a CLIP vision encoder, with features projected to match text token dimensions and a temporal cross-attention mechanism fusing current and historical image features. LiDAR point clouds are aligned via an image-LiDAR CLIP model using a frozen ViT-L/14 image encoder alongside a randomly initialized single-stride sparse transformer (SST) LiDAR encoder, trained to maximize cosine similarity between LiDAR and image features. A specially designed system message template guides the model, with training using standard cross-entropy loss with next token prediction.

Efficient Data Collection Strategy: 280 hours of driving data (50,000 routes) across 30 challenging scenarios in 8 CARLA maps with diverse environmental conditions. Speed and path decision states are automatically inferred from expert driving trajectories using hand-crafted rules, avoiding costly manual annotation. Explanation generation uses GPT-3.5 to expand variety and richness of human-annotated initial explanations.

Results

DriveMLM zero-shot generalization on nuScenes real-world driving scenes

Method Driving Score Route Completion Decision Accuracy
DriveMLM 76.1 98.1% 75.23%
Apollo 71.4 - 18.53%
ThinkTwice 70.9 - -
Interfuser 68.3 - -
LLaVA 1.5 - - 22.92%
InstructBLIP - - 17.92%
  • Decision prediction accuracy: 75.23% compared to LLaVA 1.5 (22.92%), InstructBLIP (17.92%), and Apollo (18.53%)
  • Explanation quality: Highest NLP scores including BLEU-4 (40.46), CIDEr (124.91), METEOR (56.54)
  • Closed-loop CARLA Town05 Long: Driving Score 76.1 (highest among all methods, surpassing Apollo at 71.4), Route Completion 98.1%, Miles Per Intervention 0.96
  • Complex scenario handling: System employs intelligent strategies like "borrow lane" maneuvers instead of simply stopping; correctly yields to emergency vehicles demonstrating enhanced common sense reasoning
  • Zero-shot generalization: Initial evaluations on nuScenes revealed ability to generalize to real-world driving scenes without additional training, achieving zero-shot decision accuracy of 0.395
  • Interactive capabilities: Successfully interprets natural language instructions (e.g., "I am in a hurry, please overtake") while accepting or reasonably rejecting user requests based on real-time traffic conditions
  • Practical integration demonstrated with Apollo-style frameworks, confirming plug-and-play compatibility

Limitations & Open Questions

  • Simulator-based validation only -- no real-world deployment evidence, which is the ultimate test for the "LLM in the stack" approach
  • Behavioral state abstraction may lose information compared to end-to-end approaches -- the discrete vocabulary of maneuvers cannot capture the full continuous space of driving behaviors
  • The gap between LLM-generated behavioral states and robust low-level control in safety-critical edge cases remains unaddressed

Connections