Papers | ML Systems Wiki

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

2025 arXiv 81

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01941)** End-to-end driving models typically output a single trajectory and trust it entirely, with no mechanism to evaluate whether the predicted path is safe before execu…

paper autonomous-driving vla world-model +3

SpatialVLA: Exploring Spatial Representations for VLA Models

2025 arXiv 292

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

paper robotics vla spatial-reasoning +1

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

2025 CVPR 2025

[Read on arXiv](https://arxiv.org/abs/2505.16805) SOLVE proposes a synergistic framework that combines a Vision-Language Model (VLM) reasoning branch (SOLVE-VLM) with an end-to-end (E2E) driving network (SOLVE-E2E), con…

paper autonomous-driving vla chain-of-thought +1

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

2025 arXiv 224

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

paper robotics vla efficient +1

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

2025 CVPR 2025 89

📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…

paper autonomous-driving vla vlm +3

Self-Improving Embodied Foundation Models

2025 NeurIPS 2025 18

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

paper robotics foundation-model self-improvement +3

pi0.5: A Vision-Language-Action Model with Open-World Generalization

2025 arXiv 681

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

paper robotics vla foundation-model +2

pi*0.6: A VLA That Learns From Experience

2025 arXiv 93

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

paper robotics vla reinforcement-learning +1

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation

2025 arxiv 100

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…

paper autonomous-driving vla vlm +3

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning

2025 arXiv 364

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

paper robotics vla fine-tuning +1

Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model

2025 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2503.23463)** OpenDriveVLA introduces a Vision-Language Action model specifically designed for end-to-end autonomous driving. Unlike previous approaches that use VLMs as supplem…

autonomous-driving vla end-to-end language-model

Knowledge Insulating Vision-Language-Action Models

2025 arXiv preprint

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

paper robotics vla knowledge-preservation +1

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

2025 Figure AI Technical Report

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

paper robotics vla humanoid +1

Groot N1 An Open Foundation Model For Generalist Humanoid Robots

2025 arXiv 602

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…

robotics foundation-model vla humanoid

FAST: Efficient Action Tokenization for Vision-Language-Action Models

2025 RSS 2025 353

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

paper robotics vla tokenization +1

EMMA: End-to-End Multimodal Model for Autonomous Driving

2025 TMLR 150

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

paper autonomous-driving vla vlm +3

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

2025 arXiv 55

📄 **[Read on arXiv](https://arxiv.org/abs/2505.16278)** DriveMoE introduces a dual-level Mixture-of-Experts (MoE) architecture to driving Vision-Language-Action models. The key innovation is applying expert specializati…

paper autonomous-driving vla mixture-of-experts +3

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

2025 ICCV 54

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

paper robotics vla diffusion-transformer +1

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

2025 arXiv 140

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

paper robotics vla diffusion +2

Autovala Vision Language Action Model For End To End Autonomous Driving

2025 arXiv 110

📄 **[Read on arXiv](https://arxiv.org/abs/2506.13757)** AutoVLA presents a unified approach to autonomous driving that integrates vision, language understanding, and action generation within a single autoregressive mode…

autonomous-driving vla reinforcement-learning end-to-end +1

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

2025 arXiv 75

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…

paper autonomous-driving vla vlm +3

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

2025 arXiv 42

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…

paper autonomous-driving vla vlm +3

VLP: Vision Language Planning for Autonomous Driving

2024 CVPR 155

📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…

paper autonomous-driving vla vlm +2

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

2024 arXiv 102

📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…

paper autonomous-driving vla vlm +3

RT-H: Action Hierarchies Using Language

2024 RSS 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

robotics vla transformer imitation-learning +2

RoboVLMs: What Matters in Building Vision-Language-Action Models

2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

robotics vla transformer multimodal +2

Robotic Control via Embodied Chain-of-Thought Reasoning

2024 arXiv

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

paper robotics vla chain-of-thought +1

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

2024 ICLR 2024 100

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

robotics vla imitation-learning multimodal +2

pi0: A Vision-Language-Action Flow Model for General Robot Control

2024 RSS 2025 1381

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

paper robotics vla foundation-model +1

OpenVLA: An Open-Source Vision-Language-Action Model

2024 CoRL 1883

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

paper robotics vla open-source

Lmdrive Closed Loop End To End Driving With Large Language Models

2024 CVPR 294

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07488)** LMDrive is the first system to demonstrate and benchmark LLM-based driving in closed-loop simulation, introducing the LangAuto benchmark with ~64K instruction-foll…

paper autonomous-driving llm e2e +2

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

2024 CoRL 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

robotics vla multimodal imitation-learning +3

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

robotics vla transformer foundation-model +4

DriveLM: Driving with Graph Visual Question Answering

2024 ECCV 448

📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…

paper autonomous-driving vlm reasoning +2

DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model

2024 IEEE RA-L 576

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…

paper autonomous-driving vla vlm +2

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving

2024 WACV 2025 Oral 30

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

autonomous-driving vla multimodal dataset +3

3D-VLA: A 3D Vision-Language-Action Generative World Model

2024 ICML 2024 140

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

robotics vla multimodal 3d-perception +3

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

2023 arXiv 2686

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

paper robotics vla embodied

Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving

2023 ECCV 107

📄 **[Read on arXiv](https://arxiv.org/abs/2312.03661)** Reason2Drive provides the largest reasoning chain dataset for driving (>600K video-text pairs from nuScenes, Waymo, and ONCE) and introduces an aggregated evaluati…

paper autonomous-driving vla reasoning +2

GPT-Driver: Learning to Drive with GPT

2023 NeurIPS FMDM Workshop 396

📄 **[Read on arXiv](https://arxiv.org/abs/2310.01415)** GPT-Driver reformulates autonomous driving motion planning as a language modeling problem. Scene context (object positions, velocities, lane geometry) and ego vehi…

paper autonomous-driving vla llm +2

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States

2023 arXiv 241

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09245)** DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than re…

paper autonomous-driving vla llm +2

RT-1: Robotics Transformer for Real-World Control at Scale

2022 arXiv 2019

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

paper robotics vla transformer

A Generalist Agent

2022 TMLR 1018

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

paper robotics vla generalist-agent

Talk2Car: Taking Control of Your Self-Driving Car

2019 EMNLP-IJCNLP 182

📄 **[Read on arXiv](https://arxiv.org/abs/1909.10838)** For autonomous vehicles to be truly useful as personal transportation, passengers should be able to issue natural-language commands like "park behind that blue car…

paper autonomous-driving vla grounding +2

Textual Explanations for Self-Driving Vehicles

2018 ECCV 427

📄 **[Read on arXiv](https://arxiv.org/abs/1807.11546)** End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et…

paper autonomous-driving vla explainability +2

End-to-end Driving via Conditional Imitation Learning

2018 ICRA 1227

📄 **[Read on arXiv](https://arxiv.org/abs/1710.02410)** This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level comma…

paper autonomous-driving imitation-learning e2e +1