Tags

3D-VLA: A 3D Vision-Language-Action Generative World Model

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

A Generalist Agent

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2401.12963)** AutoRT addresses the critical data scarcity problem in robotics by using foundation models not as end-effectors but as intelligent orchestrators of large-scale rob…

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

source-summary

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

FAST: Efficient Action Tokenization for Vision-Language-Action Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

Gemini Robotics Bringing Ai Into The Physical World

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

Groot N1 An Open Foundation Model For Generalist Humanoid Robots

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

source-summary

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2409.20537)** HPT tackles the fundamental challenge of building generalist robot representations that work across heterogeneous embodiments with different sensor configurations,…

Knowledge Insulating Vision-Language-Action Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

Octo An Open Source Generalist Robot Policy

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

On The Opportunities And Risks Of Foundation Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

Open Questions: Vision-Language-Action Models

query

Stream-specific open questions for the VLA pillar. See wiki/queries/open-questions for the full tree across all streams. 1. **Dual-system generality:** The dual-system pattern (slow VLM at 7-10 Hz + fast motor policy at…

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

OpenVLA: An Open-Source Vision-Language-Action Model

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

PaLM-E: An Embodied Multimodal Language Model

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2303.03378)** PaLM-E is a 562-billion parameter embodied multimodal language model created by Google that injects continuous sensor observations (images, point clouds, robot sta…

pi*0.6: A VLA That Learns From Experience

source-summary

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

pi0.5: A Vision-Language-Action Model with Open-World Generalization

source-summary

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

pi0: A Vision-Language-Action Flow Model for General Robot Control

source-summary

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

source-summary

[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…

Robocat A Self Improving Generalist Agent For Robotic Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

Robotic Control via Embodied Chain-of-Thought Reasoning

source-summary

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

Robotics

concept

Robotics is relevant to this wiki primarily as the origin of vision-language-action (VLA) models that now influence autonomous driving. The robotics community pioneered the idea that large pretrained models can serve as…

RoboVLMs: What Matters in Building Vision-Language-Action Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

RT-1: Robotics Transformer for Real-World Control at Scale

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

RT-H: Action Hierarchies Using Language

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

Self-Improving Embodied Foundation Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

source-summary

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

SpatialVLA: Exploring Spatial Representations for VLA Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

Towards Embodiment Scaling Laws in Robot Locomotion

source-summary

**[Read on arXiv](https://arxiv.org/abs/2505.05753)** This paper investigates whether increasing robot diversity during training improves generalization to unseen robots, analogous to how data scaling improves language…

UniAct: Universal Actions for Enhanced Embodied Foundation Models

source-summary

**[Read on arXiv](https://arxiv.org/abs/2501.10105)** UniAct addresses a critical challenge in embodied AI: robot action data suffers from severe heterogeneity across platforms, control interfaces, and physical embodime…

Unisim Learning Interactive Real World Simulators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.14803)** Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future…

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2307.05973)** VoxPoser addresses a fundamental bottleneck in robot manipulation: translating open-ended natural language instructions into precise physical actions without requi…

Pages tagged robotics