VLA and Driving
This queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied to autonomous driving.
General VLA / multimodal action foundations
- Gato
- PaLM-E
- RT-1
- RT-2
- RoboCat
- Octo
- OpenVLA
- UniAct
- Dita
- SmolVLA
- pi0 (Physical Intelligence, 2024) -- flow matching VLA on PaliGemma 3B, 7 platforms, 68 tasks. The reference VLA.
- pi0.5 (Physical Intelligence, CoRL 2025) -- hierarchical VLA with open-world generalization, 10-15 min tasks in unseen homes
- pi0.6 (Physical Intelligence, 2025) -- RECAP offline RL for VLA self-improvement, doubled task throughput over imitation
- FAST (Physical Intelligence / UC Berkeley, RSS 2025) -- DCT+BPE action tokenizer, 5x faster VLA training
- OpenVLA-OFT (Stanford, 2025) -- parallel decoding fine-tuning recipe, 76.5% -> 97.1% on LIBERO, 26x speedup
- SpatialVLA (Shanghai AI Lab, 2025) -- Ego3D position encoding for spatial awareness, 1.1M real episodes
- DexVLA (Shanghai Jiao Tong, CoRL 2025) -- 1B diffusion expert for dexterous/bimanual manipulation, 0.92 shirt folding
- Knowledge Insulation (Physical Intelligence, NeurIPS 2025 Spotlight) -- prevents VLM degradation during VLA training, 7.5x faster convergence
- VoxPoser (Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models, CoRL 2023) -- LLM-generated 3D value maps for zero-shot manipulation via code composition + MPC, no robot-specific training
Driving-specific language and action papers
Wave 1: Foundations (2018–2019)
- Conditional Imitation Learning (Codevilla et al., 2018) — route-conditioned E2E driving, branched architecture
- Textual Explanations for Self-Driving (Kim et al., 2018) — BDD-X dataset, attention-aligned explanations
- Talk2Car (Deruyttere et al., 2019) — natural language command grounding on nuScenes
Wave 2: LLM-as-Planner (2023–2024)
- GPT-Driver (Mao et al., 2023) — planning as language modeling via GPT-3.5
- Agent-Driver (Mao et al., ICLR 2024) — LLM as cognitive agent with tool library, cognitive memory, and chain-of-thought reasoning for driving
- DriveGPT4 (Xu et al., 2024) — multimodal instruction tuning for joint control + explanation
- LMDrive (Shao et al., 2024) — first closed-loop language-conditioned driving
- VLP (Pan et al., 2024) — LM semantic priors in BEV planning
- DriveLM (Sima et al., 2024) — Graph VQA decomposing perception→prediction→planning
- Reason2Drive (Nie et al., 2024) — large-scale video-text reasoning chains
- DriveMLM (Wang et al., 2023) — plug-and-play LLM for behavioral planning
- Drive as You Speak (Cui et al., 2023) — LLM as bidirectional human-vehicle interaction interface, not planner
- Talk2Drive (Cui et al., IEEE ITSC 2024) — LLM-based personalized driving via memory module, real-world deployment, 65.2% takeover reduction
- Driving with LLMs (Wayve, 2023) — first concrete LLM-for-driving with vector modality, explainable AD
- Senna (Jiang et al., 2024) — decoupled LVLM reasoning + E2E trajectory prediction
- DriveVLM
- VAD
- VADv2 (Chen et al., 2024) — probabilistic planning via action vocabulary, LLM-inspired next-action prediction, CARLA SOTA
- GenAD (2024) — E2E driving as generative trajectory modeling, 0.91m L2, 0.43% collision
- PARA-Drive (NVIDIA, 2024) — fully parallel E2E architecture, systematic design space exploration
- DriveDreamer (2023) — first real-world-driven world model for driving, diffusion-based video generation
- Is Ego Status All You Need? (NVIDIA/Nanjing, 2023) — exposes weakness of open-loop nuScenes evaluation
Wave 3: Reasoning-to-Action (2025)
- SimLingo (Renz et al., 2025) — vision-only closed-loop VLA with Action Dreaming
- ORION (Fu et al., 2025) — holistic reasoning→planning via QT-Former + planning token
- EMMA (Hwang et al., 2025) — Waymo industry-scale "everything as language" model
- Alpamayo-R1 (Wang et al., 2025) — NVIDIA production VLA, 99ms latency, real road testing
- WoTE (Li et al., 2025) — BEV world model for online trajectory evaluation
- AlphaDrive (Jiang et al., 2025) — GRPO-based RL for driving VLMs (DeepSeek R1-style)
- DriveMoE (Yang et al., 2025) — Mixture-of-Experts for scene/skill specialization
- AutoVLA (UCLA, 2025) — dual-process adaptive reasoning VLA with RL fine-tuning
- DriveTransformer (2025, ICLR) — unified parallel-task transformer, sparse queries, SOTA Bench2Drive
- OpenDriveVLA (2025) — open-source VLA with 3D spatial-aware hierarchical scene queries (0.5B-7B)
- DiMA (2025) — distill MLLM reasoning into efficient vision planner, discard LLM at inference
- MomAD (2025) — momentum-aware planning for temporal consistency in E2E driving
- HERMES (2025) — unified world model for simultaneous 3D scene understanding and generation
- GaussianWorld (2024) — Gaussian world model for streaming 3D occupancy prediction
- DiffusionDrive (HUST/Horizon, 2025) — truncated diffusion for E2E planning, 88.1 PDMS, 2 steps, 45 FPS
- DriveGPT (Waymo, 2025) — first scaling laws for driving, 1.1B autoregressive behavior model
- GoalFlow (Horizon/HKU, 2025) — goal-driven flow matching, 90.3 PDMS, single-step inference
- LAW (CASIA, 2025) — self-supervised latent world model for E2E driving, SOTA nuScenes+NAVSIM+CARLA
- CarPlanner (ZJU, 2025) — first RL planner to beat IL+rules on nuPlan, consistency-regularized AR
- SOLVE (HUST, 2025) — Sequential Q-Former + Trajectory CoT, VLM-E2E synergy
Key design axes (from AutoVLA analysis)
| Axis | Options seen in corpus |
|---|---|
| Language role | supervision / runtime control / explanation / all three |
| Action space | waypoints / controls / planner tokens / language tokens |
| Evaluation | open-loop only / closed-loop sim / real-world |
| Architecture | VLM + planner hook / true VLA / decoupled reasoning + E2E |
| Training | IL only / IL + RL / GRPO / multi-stage |
Questions to answer while ingesting
- Is language used for supervision, runtime control, explanation, or all three?
- What is the action space?
- Does the paper improve actual planning, or mainly interpretation and interface quality?
- Is the system a VLM with planner hooks, or a true VLA model?
- Open-loop or closed-loop evaluation?
- Does it handle long-tail / adversarial scenarios?
Warning
This area is recent and terminology is unstable. The wiki should be stricter than the papers are about the difference between vision-language reasoning and action generation.
Ingested papers
Batch 01 (general VLA + early driving)
- A Generalist Agent
- Rt 1 Robotics Transformer For Real World Control At Scale
- Palm E An Embodied Multimodal Language Model
- Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control
- Openvla An Open Source Vision Language Action Model
- Drivelm Driving With Graph Visual Question Answering
- Lmdrive Closed Loop End To End Driving With Large Language Models
- Drivevlm The Convergence Of Autonomous Driving And Large Vision Language Models
Batch 02 (AutoVLA corpus)
- End To End Driving Via Conditional Imitation Learning
- Textual Explanations For Self Driving
- Talk2Car
- Gpt Driver
- Drivegpt4
- Vlp Vision Language Planning
- Reason2Drive
- Simlingo
- Orion
- Emma
- Drivemlm
- Alpamayo R1
- Senna
- Wote Bev World Model
- Alphadrive
- Drivemoe
- Drivor Driving On Registers
Batch 03 (robotics VLA + world models + driving)
- Groot N1 An Open Foundation Model For Generalist Humanoid Robots
- Gemini Robotics Bringing Ai Into The Physical World
- Cosmos World Foundation Model Platform For Physical Ai
- Autovala Vision Language Action Model For End To End Autonomous Driving
- Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
Batch 04 (self-supervised driving, temporal E2E, BEV perception, world models, embodied RL)
- S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation
- Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction
- Self Improving Embodied Foundation Models
- Gaussianlss Toward Real World Bev Perception With Depth Uncertainty Via Gaussian Splatting
- Drive Occworld Driving In The Occupancy World
Batch 05 (VLA, world models, momentum planning, distillation)
- Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
- Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation
- Momad Momentum Aware Planning In End To End Autonomous Driving
- Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction
- Dima Distilling Multi Modal Large Language Models For Autonomous Driving
Batch 06 (cross-embodiment robotics VLA + 3D occupancy perception)
- Uniact Universal Actions For Enhanced Embodied Foundation Models
- Dita Scaling Diffusion Transformer For Generalist Vla Policy
- Embodiment Scaling Laws In Robot Locomotion
- Smolvla A Vision Language Action Model For Affordable Robotics
- Gaussianformer 2 Probabilistic Gaussian Superposition For Efficient 3D Occupancy Prediction
- Occmamba Semantic Occupancy Prediction With State Space Models
- Gausstr Foundation Model Aligned Gaussian Transformer For Self Supervised 3D
- Bevdiffuser Plug And Play Diffusion Model For Bev Denoising
Batch 06 (diffusion/flow planning, scaling laws, RL planning, VLM-E2E synergy, robotics VLA/diffusion)
- Diffusiondrive Truncated Diffusion Model For End To End Autonomous Driving
- Drivegpt Scaling Autoregressive Behavior Models For Driving
- Goalflow Goal Driven Flow Matching For Multimodal Trajectory Generation
- Law Enhancing End To End Autonomous Driving With Latent World Model
- Carplanner Consistent Autoregressive Rl Planner For Autonomous Driving
- Solve Synergy Of Language Vision And End To End Networks For Autonomous Driving
- Ecot Embodied Chain Of Thought Reasoning For Vision Language Action Models
- Rdt 1B A Diffusion Foundation Model For Bimanual Manipulation
Batch 07 (Physical Intelligence VLA family + robotics VLA advances)
- Pi0 A Vision Language Action Flow Model For General Robot Control
- Pi05 A Vision Language Action Model With Open World Generalization
- Pi06 A Vla That Learns From Experience
- Fast Efficient Action Tokenization For Vision Language Action Models
- Openvla Oft Optimizing Speed And Success For Vla Fine Tuning
- Spatialvla Exploring Spatial Representations For Vla Models
- Dexvla Vision Language Model With Plug In Diffusion Expert
- Knowledge Insulating Vision Language Action Models
Batch 08 (world models, parallel E2E, generative driving, evaluation, LLM-for-driving)
- Drivedreamer Towards Real World Driven World Models
- Para Drive Parallelized Architecture For Real Time Autonomous Driving
- Genad Generative End To End Autonomous Driving
- Is Ego Status All You Need For Open Loop End To End Autonomous Driving
- Driving With Llms Fusing Object Level Vector Modality For Explainable Autonomous Driving
- Drive As You Speak Enabling Human Like Interaction With Large Language Models In Autonomous Vehicles
Batch 09 (orchestration, cross-embodiment, async planning, Gaussian representations, occupancy world models)
- Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents
- Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers
- Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving
- Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction
- Driving Gaussian Composite Gaussian Splatting For Surrounding Dynamic Driving Scenes
- Occworld Learning A 3D Occupancy World Model For Autonomous Driving