ESC

VLA and Driving

This queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied to autonomous driving.

General VLA / multimodal action foundations

  • Gato
  • PaLM-E
  • RT-1
  • RT-2
  • RoboCat
  • Octo
  • OpenVLA
  • UniAct
  • Dita
  • SmolVLA
  • pi0 (Physical Intelligence, 2024) -- flow matching VLA on PaliGemma 3B, 7 platforms, 68 tasks. The reference VLA.
  • pi0.5 (Physical Intelligence, CoRL 2025) -- hierarchical VLA with open-world generalization, 10-15 min tasks in unseen homes
  • pi0.6 (Physical Intelligence, 2025) -- RECAP offline RL for VLA self-improvement, doubled task throughput over imitation
  • FAST (Physical Intelligence / UC Berkeley, RSS 2025) -- DCT+BPE action tokenizer, 5x faster VLA training
  • OpenVLA-OFT (Stanford, 2025) -- parallel decoding fine-tuning recipe, 76.5% -> 97.1% on LIBERO, 26x speedup
  • SpatialVLA (Shanghai AI Lab, 2025) -- Ego3D position encoding for spatial awareness, 1.1M real episodes
  • DexVLA (Shanghai Jiao Tong, CoRL 2025) -- 1B diffusion expert for dexterous/bimanual manipulation, 0.92 shirt folding
  • Knowledge Insulation (Physical Intelligence, NeurIPS 2025 Spotlight) -- prevents VLM degradation during VLA training, 7.5x faster convergence
  • VoxPoser (Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models, CoRL 2023) -- LLM-generated 3D value maps for zero-shot manipulation via code composition + MPC, no robot-specific training

Driving-specific language and action papers

Wave 1: Foundations (2018–2019)

  • Conditional Imitation Learning (Codevilla et al., 2018) — route-conditioned E2E driving, branched architecture
  • Textual Explanations for Self-Driving (Kim et al., 2018) — BDD-X dataset, attention-aligned explanations
  • Talk2Car (Deruyttere et al., 2019) — natural language command grounding on nuScenes

Wave 2: LLM-as-Planner (2023–2024)

  • GPT-Driver (Mao et al., 2023) — planning as language modeling via GPT-3.5
  • Agent-Driver (Mao et al., ICLR 2024) — LLM as cognitive agent with tool library, cognitive memory, and chain-of-thought reasoning for driving
  • DriveGPT4 (Xu et al., 2024) — multimodal instruction tuning for joint control + explanation
  • LMDrive (Shao et al., 2024) — first closed-loop language-conditioned driving
  • VLP (Pan et al., 2024) — LM semantic priors in BEV planning
  • DriveLM (Sima et al., 2024) — Graph VQA decomposing perception→prediction→planning
  • Reason2Drive (Nie et al., 2024) — large-scale video-text reasoning chains
  • DriveMLM (Wang et al., 2023) — plug-and-play LLM for behavioral planning
  • Drive as You Speak (Cui et al., 2023) — LLM as bidirectional human-vehicle interaction interface, not planner
  • Talk2Drive (Cui et al., IEEE ITSC 2024) — LLM-based personalized driving via memory module, real-world deployment, 65.2% takeover reduction
  • Driving with LLMs (Wayve, 2023) — first concrete LLM-for-driving with vector modality, explainable AD
  • Senna (Jiang et al., 2024) — decoupled LVLM reasoning + E2E trajectory prediction
  • DriveVLM
  • VAD
  • VADv2 (Chen et al., 2024) — probabilistic planning via action vocabulary, LLM-inspired next-action prediction, CARLA SOTA
  • GenAD (2024) — E2E driving as generative trajectory modeling, 0.91m L2, 0.43% collision
  • PARA-Drive (NVIDIA, 2024) — fully parallel E2E architecture, systematic design space exploration
  • DriveDreamer (2023) — first real-world-driven world model for driving, diffusion-based video generation
  • Is Ego Status All You Need? (NVIDIA/Nanjing, 2023) — exposes weakness of open-loop nuScenes evaluation

Wave 3: Reasoning-to-Action (2025)

  • SimLingo (Renz et al., 2025) — vision-only closed-loop VLA with Action Dreaming
  • ORION (Fu et al., 2025) — holistic reasoning→planning via QT-Former + planning token
  • EMMA (Hwang et al., 2025) — Waymo industry-scale "everything as language" model
  • Alpamayo-R1 (Wang et al., 2025) — NVIDIA production VLA, 99ms latency, real road testing
  • WoTE (Li et al., 2025) — BEV world model for online trajectory evaluation
  • AlphaDrive (Jiang et al., 2025) — GRPO-based RL for driving VLMs (DeepSeek R1-style)
  • DriveMoE (Yang et al., 2025) — Mixture-of-Experts for scene/skill specialization
  • AutoVLA (UCLA, 2025) — dual-process adaptive reasoning VLA with RL fine-tuning
  • DriveTransformer (2025, ICLR) — unified parallel-task transformer, sparse queries, SOTA Bench2Drive
  • OpenDriveVLA (2025) — open-source VLA with 3D spatial-aware hierarchical scene queries (0.5B-7B)
  • DiMA (2025) — distill MLLM reasoning into efficient vision planner, discard LLM at inference
  • MomAD (2025) — momentum-aware planning for temporal consistency in E2E driving
  • HERMES (2025) — unified world model for simultaneous 3D scene understanding and generation
  • GaussianWorld (2024) — Gaussian world model for streaming 3D occupancy prediction
  • DiffusionDrive (HUST/Horizon, 2025) — truncated diffusion for E2E planning, 88.1 PDMS, 2 steps, 45 FPS
  • DriveGPT (Waymo, 2025) — first scaling laws for driving, 1.1B autoregressive behavior model
  • GoalFlow (Horizon/HKU, 2025) — goal-driven flow matching, 90.3 PDMS, single-step inference
  • LAW (CASIA, 2025) — self-supervised latent world model for E2E driving, SOTA nuScenes+NAVSIM+CARLA
  • CarPlanner (ZJU, 2025) — first RL planner to beat IL+rules on nuPlan, consistency-regularized AR
  • SOLVE (HUST, 2025) — Sequential Q-Former + Trajectory CoT, VLM-E2E synergy

Key design axes (from AutoVLA analysis)

Axis Options seen in corpus
Language role supervision / runtime control / explanation / all three
Action space waypoints / controls / planner tokens / language tokens
Evaluation open-loop only / closed-loop sim / real-world
Architecture VLM + planner hook / true VLA / decoupled reasoning + E2E
Training IL only / IL + RL / GRPO / multi-stage

Questions to answer while ingesting

  • Is language used for supervision, runtime control, explanation, or all three?
  • What is the action space?
  • Does the paper improve actual planning, or mainly interpretation and interface quality?
  • Is the system a VLM with planner hooks, or a true VLA model?
  • Open-loop or closed-loop evaluation?
  • Does it handle long-tail / adversarial scenarios?

Warning

This area is recent and terminology is unstable. The wiki should be stricter than the papers are about the difference between vision-language reasoning and action generation.

Ingested papers

Batch 01 (general VLA + early driving)

Batch 02 (AutoVLA corpus)

Batch 03 (robotics VLA + world models + driving)

Batch 04 (self-supervised driving, temporal E2E, BEV perception, world models, embodied RL)

Batch 05 (VLA, world models, momentum planning, distillation)

Batch 06 (cross-embodiment robotics VLA + 3D occupancy perception)

Batch 06 (diffusion/flow planning, scaling laws, RL planning, VLM-E2E synergy, robotics VLA/diffusion)

Batch 07 (Physical Intelligence VLA family + robotics VLA advances)

Batch 08 (world models, parallel E2E, generative driving, evaluation, LLM-for-driving)

Batch 09 (orchestration, cross-embodiment, async planning, Gaussian representations, occupancy world models)