VLA and Driving

This queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied to autonomous driving.

General VLA / multimodal action foundations

Gato
PaLM-E
RT-1
RT-2
RoboCat
Octo
OpenVLA
UniAct
Dita
SmolVLA
pi0 (Physical Intelligence, 2024) -- flow matching VLA on PaliGemma 3B, 7 platforms, 68 tasks. The reference VLA.
pi0.5 (Physical Intelligence, CoRL 2025) -- hierarchical VLA with open-world generalization, 10-15 min tasks in unseen homes
pi0.6 (Physical Intelligence, 2025) -- RECAP offline RL for VLA self-improvement, doubled task throughput over imitation
FAST (Physical Intelligence / UC Berkeley, RSS 2025) -- DCT+BPE action tokenizer, 5x faster VLA training
OpenVLA-OFT (Stanford, 2025) -- parallel decoding fine-tuning recipe, 76.5% -> 97.1% on LIBERO, 26x speedup
SpatialVLA (Shanghai AI Lab, 2025) -- Ego3D position encoding for spatial awareness, 1.1M real episodes
DexVLA (Shanghai Jiao Tong, CoRL 2025) -- 1B diffusion expert for dexterous/bimanual manipulation, 0.92 shirt folding
Knowledge Insulation (Physical Intelligence, NeurIPS 2025 Spotlight) -- prevents VLM degradation during VLA training, 7.5x faster convergence
VoxPoser (Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models, CoRL 2023) -- LLM-generated 3D value maps for zero-shot manipulation via code composition + MPC, no robot-specific training

Driving-specific language and action papers

Wave 1: Foundations (2018–2019)

Conditional Imitation Learning (Codevilla et al., 2018) — route-conditioned E2E driving, branched architecture
Textual Explanations for Self-Driving (Kim et al., 2018) — BDD-X dataset, attention-aligned explanations
Talk2Car (Deruyttere et al., 2019) — natural language command grounding on nuScenes

Wave 2: LLM-as-Planner (2023–2024)

GPT-Driver (Mao et al., 2023) — planning as language modeling via GPT-3.5
Agent-Driver (Mao et al., ICLR 2024) — LLM as cognitive agent with tool library, cognitive memory, and chain-of-thought reasoning for driving
DriveGPT4 (Xu et al., 2024) — multimodal instruction tuning for joint control + explanation
LMDrive (Shao et al., 2024) — first closed-loop language-conditioned driving
VLP (Pan et al., 2024) — LM semantic priors in BEV planning
DriveLM (Sima et al., 2024) — Graph VQA decomposing perception→prediction→planning
Reason2Drive (Nie et al., 2024) — large-scale video-text reasoning chains
DriveMLM (Wang et al., 2023) — plug-and-play LLM for behavioral planning
Drive as You Speak (Cui et al., 2023) — LLM as bidirectional human-vehicle interaction interface, not planner
Talk2Drive (Cui et al., IEEE ITSC 2024) — LLM-based personalized driving via memory module, real-world deployment, 65.2% takeover reduction
Driving with LLMs (Wayve, 2023) — first concrete LLM-for-driving with vector modality, explainable AD
Senna (Jiang et al., 2024) — decoupled LVLM reasoning + E2E trajectory prediction
DriveVLM
VAD
VADv2 (Chen et al., 2024) — probabilistic planning via action vocabulary, LLM-inspired next-action prediction, CARLA SOTA
GenAD (2024) — E2E driving as generative trajectory modeling, 0.91m L2, 0.43% collision
PARA-Drive (NVIDIA, 2024) — fully parallel E2E architecture, systematic design space exploration
DriveDreamer (2023) — first real-world-driven world model for driving, diffusion-based video generation
Is Ego Status All You Need? (NVIDIA/Nanjing, 2023) — exposes weakness of open-loop nuScenes evaluation

Wave 3: Reasoning-to-Action (2025)

SimLingo (Renz et al., 2025) — vision-only closed-loop VLA with Action Dreaming
ORION (Fu et al., 2025) — holistic reasoning→planning via QT-Former + planning token
EMMA (Hwang et al., 2025) — Waymo industry-scale "everything as language" model
Alpamayo-R1 (Wang et al., 2025) — NVIDIA production VLA, 99ms latency, real road testing
WoTE (Li et al., 2025) — BEV world model for online trajectory evaluation
AlphaDrive (Jiang et al., 2025) — GRPO-based RL for driving VLMs (DeepSeek R1-style)
DriveMoE (Yang et al., 2025) — Mixture-of-Experts for scene/skill specialization
AutoVLA (UCLA, 2025) — dual-process adaptive reasoning VLA with RL fine-tuning
DriveTransformer (2025, ICLR) — unified parallel-task transformer, sparse queries, SOTA Bench2Drive
OpenDriveVLA (2025) — open-source VLA with 3D spatial-aware hierarchical scene queries (0.5B-7B)
DiMA (2025) — distill MLLM reasoning into efficient vision planner, discard LLM at inference
MomAD (2025) — momentum-aware planning for temporal consistency in E2E driving
HERMES (2025) — unified world model for simultaneous 3D scene understanding and generation
GaussianWorld (2024) — Gaussian world model for streaming 3D occupancy prediction
DiffusionDrive (HUST/Horizon, 2025) — truncated diffusion for E2E planning, 88.1 PDMS, 2 steps, 45 FPS
DriveGPT (Waymo, 2025) — first scaling laws for driving, 1.1B autoregressive behavior model
GoalFlow (Horizon/HKU, 2025) — goal-driven flow matching, 90.3 PDMS, single-step inference
LAW (CASIA, 2025) — self-supervised latent world model for E2E driving, SOTA nuScenes+NAVSIM+CARLA
CarPlanner (ZJU, 2025) — first RL planner to beat IL+rules on nuPlan, consistency-regularized AR
SOLVE (HUST, 2025) — Sequential Q-Former + Trajectory CoT, VLM-E2E synergy

Key design axes (from AutoVLA analysis)

Axis	Options seen in corpus
Language role	supervision / runtime control / explanation / all three
Action space	waypoints / controls / planner tokens / language tokens
Evaluation	open-loop only / closed-loop sim / real-world
Architecture	VLM + planner hook / true VLA / decoupled reasoning + E2E
Training	IL only / IL + RL / GRPO / multi-stage

Questions to answer while ingesting

Is language used for supervision, runtime control, explanation, or all three?
What is the action space?
Does the paper improve actual planning, or mainly interpretation and interface quality?
Is the system a VLM with planner hooks, or a true VLA model?
Open-loop or closed-loop evaluation?
Does it handle long-tail / adversarial scenarios?

Warning

This area is recent and terminology is unstable. The wiki should be stricter than the papers are about the difference between vision-language reasoning and action generation.

VLA and Driving

General VLA / multimodal action foundations

Driving-specific language and action papers

Wave 1: Foundations (2018–2019)

Wave 2: LLM-as-Planner (2023–2024)

Wave 3: Reasoning-to-Action (2025)

Key design axes (from AutoVLA analysis)

Questions to answer while ingesting

Warning

Ingested papers

Batch 01 (general VLA + early driving)

Batch 02 (AutoVLA corpus)

Batch 03 (robotics VLA + world models + driving)

Batch 04 (self-supervised driving, temporal E2E, BEV perception, world models, embodied RL)

Batch 05 (VLA, world models, momentum planning, distillation)

Batch 06 (cross-embodiment robotics VLA + 3D occupancy perception)

Batch 06 (diffusion/flow planning, scaling laws, RL planning, VLM-E2E synergy, robotics VLA/diffusion)

Batch 07 (Physical Intelligence VLA family + robotics VLA advances)

Batch 08 (world models, parallel E2E, generative driving, evaluation, LLM-for-driving)

Batch 09 (orchestration, cross-embodiment, async planning, Gaussian representations, occupancy world models)