Papers | ML Systems Wiki

UniAct: Universal Actions for Enhanced Embodied Foundation Models

2025 CVPR 60

**[Read on arXiv](https://arxiv.org/abs/2501.10105)** UniAct addresses a critical challenge in embodied AI: robot action data suffers from severe heterogeneity across platforms, control interfaces, and physical embodime…

paper robotics foundation-model cross-embodiment +1

Towards Embodiment Scaling Laws in Robot Locomotion

2025 CoRL 10

**[Read on arXiv](https://arxiv.org/abs/2505.05753)** This paper investigates whether increasing robot diversity during training improves generalization to unseen robots, analogous to how data scaling improves language…

paper robotics scaling-laws locomotion +1

SpatialVLA: Exploring Spatial Representations for VLA Models

2025 arXiv 292

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

paper robotics vla spatial-reasoning +1

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

2025 arXiv 224

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

paper robotics vla efficient +1

Self-Improving Embodied Foundation Models

2025 NeurIPS 2025 18

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

paper robotics foundation-model self-improvement +3

RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

2025 ICLR

[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…

paper robotics diffusion bimanual +1

pi0.5: A Vision-Language-Action Model with Open-World Generalization

2025 arXiv 681

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

paper robotics vla foundation-model +2

pi*0.6: A VLA That Learns From Experience

2025 arXiv 93

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

paper robotics vla reinforcement-learning +1

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning

2025 arXiv 364

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

paper robotics vla fine-tuning +1

Knowledge Insulating Vision-Language-Action Models

2025 arXiv preprint

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

paper robotics vla knowledge-preservation +1

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

2025 Figure AI Technical Report

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

paper robotics vla humanoid +1

Groot N1 An Open Foundation Model For Generalist Humanoid Robots

2025 arXiv 602

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…

robotics foundation-model vla humanoid

Gemini Robotics Bringing Ai Into The Physical World

2025 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…

robotics foundation-model multimodal reasoning

FAST: Efficient Action Tokenization for Vision-Language-Action Models

2025 RSS 2025 353

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

paper robotics vla tokenization +1

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

2025 ICCV 54

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

paper robotics vla diffusion-transformer +1

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

2025 arXiv 140

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

paper robotics vla diffusion +2

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

2024 ICML 2025 (Spotlight) 139

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.14803)** Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future…

paper robotics video-prediction foundation-model +1

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

2024 ICLR 2024 150

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

robotics transformer imitation-learning multimodal +3

Unisim Learning Interactive Real World Simulators

2024 ICLR 2024 (Oral) 200

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

world-model diffusion simulation robotics +4

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

2024 CoRL 2024 Oral 100

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

robotics transformer cross-embodiment imitation-learning +2

RT-H: Action Hierarchies Using Language

2024 RSS 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

robotics vla transformer imitation-learning +2

RoboVLMs: What Matters in Building Vision-Language-Action Models

2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

robotics vla transformer multimodal +2

Robotic Control via Embodied Chain-of-Thought Reasoning

2024 arXiv

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

paper robotics vla chain-of-thought +1

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

2024 ICLR 2024 100

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

robotics vla imitation-learning multimodal +2

pi0: A Vision-Language-Action Flow Model for General Robot Control

2024 RSS 2025 1381

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

paper robotics vla foundation-model +1

OpenVLA: An Open-Source Vision-Language-Action Model

2024 CoRL 1883

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

paper robotics vla open-source

Octo An Open Source Generalist Robot Policy

2024 RSS 400

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

robotics transformer foundation-model open-source +3

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

2024 CoRL 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

robotics vla multimodal imitation-learning +3

Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers

2024 NeurIPS 134

📄 **[Read on arXiv](https://arxiv.org/abs/2409.20537)** HPT tackles the fundamental challenge of building generalist robot representations that work across heterogeneous embodiments with different sensor configurations,…

robotics foundation-model cross-embodiment proprioception

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

robotics vla transformer foundation-model +4

Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents

2024 arXiv 110

📄 **[Read on arXiv](https://arxiv.org/abs/2401.12963)** AutoRT addresses the critical data scarcity problem in robotics by using foundation models not as end-effectors but as intelligent orchestrators of large-scale rob…

robotics foundation-model orchestration data-collection

3D-VLA: A 3D Vision-Language-Action Generative World Model

2024 ICML 2024 140

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

robotics vla multimodal 3d-perception +3

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

2023 CoRL 2023 450

📄 **[Read on arXiv](https://arxiv.org/abs/2307.05973)** VoxPoser addresses a fundamental bottleneck in robot manipulation: translating open-ended natural language instructions into precise physical actions without requi…

robotics manipulation language-modeling multimodal +2

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

2023 arXiv 2686

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

paper robotics vla embodied

Robocat A Self Improving Generalist Agent For Robotic Manipulation

2023 TMLR 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

robotics transformer imitation-learning multimodal +2

PaLM-E: An Embodied Multimodal Language Model

2023 ICML 2491

📄 **[Read on arXiv](https://arxiv.org/abs/2303.03378)** PaLM-E is a 562-billion parameter embodied multimodal language model created by Google that injects continuous sensor observations (images, point clouds, robot sta…

paper robotics vlm embodied

RT-1: Robotics Transformer for Real-World Control at Scale

2022 arXiv 2019

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

paper robotics vla transformer

A Generalist Agent

2022 TMLR 1018

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

paper robotics vla generalist-agent

On The Opportunities And Risks Of Foundation Models

2021 arXiv (Stanford HAI) 6057

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

foundation-model nlp computer-vision robotics +3