Vision Language Action

This page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025.

Working definition

A VLA system consumes visual context and language-conditioned intent, then emits actions or action-relevant latent state. In robotics, actions may be motor commands or low-level policies. In driving, actions may be trajectories, waypoints, controls, or planner tokens.

Important distinctions

VLM vs VLA: understanding-only systems are not action models. A VLM that describes a driving scene is not the same as a VLA that outputs a trajectory.
Language as supervision vs language as runtime interface: CIL uses discrete commands at runtime; BDD-X uses language only during training; LMDrive uses language at runtime; DriveLM uses it for structured reasoning.
Action tokens vs continuous controls: EMMA tokenizes everything including trajectories; Senna decouples language reasoning from continuous E2E planning.
Offline imitation vs interactive control: Open-loop evaluation (GPT-Driver, DriveGPT4) vs closed-loop (LMDrive, SimLingo, ORION).

Three waves of driving VLA (from AutoVLA analysis)

Wave 1: Foundations (2018–2019)

Conditional Imitation Learning established intent-conditioned driving with a 4-word vocabulary
BDD-X introduced language-action alignment through attention-based explanations
Talk2Car grounded free-form language to objects in driving scenes

Wave 2: LLM-as-Planner (2023–2024)

Explosion of LLM/VLM applications to driving
Key tension: language for planning (GPT-Driver) vs language for explanation (DriveGPT4) vs language for structured reasoning (DriveLM)
Critical finding: open-loop evaluation is insufficient — LMDrive demonstrated closed-loop is essential

Wave 3: Reasoning-to-Action (2025)

Focus shifts to bridging the reasoning-action gap
RL enters the picture: AlphaDrive applies GRPO-based RL (DeepSeek R1-style) to driving VLMs
World models complement VLAs: WoTE uses BEV world models for trajectory safety verification
MoE architectures: DriveMoE addresses mode averaging through expert specialization
Production deployment: Alpamayo-R1 achieves 99ms latency with real road testing
Adaptive reasoning: AutoVLA introduces dual-process thinking (fast/slow) for driving VLAs with RL fine-tuning
3D-grounded VLA: OpenDriveVLA integrates hierarchical 3D queries into LLM, achieves SOTA at 0.5B scale
Distillation as deployment strategy: DiMA jointly trains MLLM + vision planner, discards MLLM at inference (80% collision reduction, zero overhead)

Key design axes

Axis	Options	Key papers
Language role	Supervision / runtime control / explanation	BDD-X / LMDrive / DriveGPT4
Action space	Controls / waypoints / planner tokens / language tokens	CIL / VAD / ORION / EMMA
Architecture	VLM + planner / true VLA / decoupled	DriveGPT4 / SimLingo / Senna
Evaluation	Open-loop / closed-loop sim / real-world	GPT-Driver / LMDrive / Alpamayo-R1
Training	IL / IL+RL / GRPO / multi-stage	CIL / Alpamayo-R1 / AlphaDrive / ORION

Emerging consensus (as of 2025)

Closed-loop evaluation is non-negotiable for driving VLAs — open-loop metrics don't predict driving competence
Language is most valuable as intermediate reasoning, not as the action output itself (Senna's human-readable bridge, ORION's planning token)
RL is the next frontier — SFT ceiling appears real; AlphaDrive and Alpamayo-R1 both use RL to push beyond imitation
World models and VLAs are complementary, not competing — WoTE shows physics-based verification can catch VLA failures
MoE architectures address the fundamental mode-averaging problem in diverse driving scenarios

Robotics VLA frontier (2025)

The VLA paradigm continues to push boundaries in robotics:

Video Prediction Policy (Video Prediction Policy A Generalist Robot Policy With Predictive Visual Representations, ICML 2025 Spotlight) reinterprets video diffusion models as predictive visual encoders rather than generators, achieving 18.6% improvement on CALVIN by extracting future-encoding representations from a single VDM forward pass.
Helix (Helix A Vla For Generalist Humanoid Control, Figure AI 2025) is the first VLA to control an entire humanoid upper body (35 DoF) at 200 Hz via a dual-system architecture: a 7B VLM (System 2, 7-9 Hz) feeds latent representations to an 80M visuomotor policy (System 1, 200 Hz). It demonstrates dual-robot coordination on shared long-horizon tasks.

Driving-specific open questions

Does language add supervision, controllability, interpretability, or only presentation value?
Can VLA-style pretraining reduce the amount of task-specific driving data needed?
What is the right action abstraction for driving: controls, trajectories, anchors, or planner state?
Will the decoupled approach (Senna: separate reasoning + planning) or unified approach (EMMA: everything as tokens) win?
Can GRPO-style RL scale to production driving the way it scaled for LLM reasoning?