Robotics
Robotics is relevant to this wiki primarily as the origin of vision-language-action (VLA) models that now influence autonomous driving. The robotics community pioneered the idea that large pretrained models can serve as general-purpose controllers for embodied agents, and the driving community has rapidly adopted and adapted these ideas.
The VLA revolution in robotics
The modern VLA trajectory begins with generalist agents and scales through increasingly capable architectures.
Gato (A Generalist Agent, 2022) demonstrated that a single transformer, trained on a mixture of text, images, and control tokens, could play Atari, caption images, and control a robot arm. The key insight was not state-of-the-art performance on any single task but that a unified token-based architecture could handle heterogeneous modalities including actions. Gato established the "everything as tokens" paradigm that EMMA later brought to driving.
RT-1 (Rt 1 Robotics Transformer For Real World Control At Scale, 2022) moved from proof-of-concept to real-world scale. Trained on 130k demonstrations across 700+ tasks, RT-1 showed that transformers could absorb large-scale robotic data and generalize across tasks when conditioned on language instructions. The architecture used a FiLM-conditioned EfficientNet backbone with a transformer trunk, outputting discretized actions.
PaLM-E (Palm E An Embodied Multimodal Language Model, 2023) asked whether a large language model could serve as the reasoning backbone for an embodied agent. By injecting visual tokens into PaLM's 562B-parameter language model, PaLM-E demonstrated that LLM-scale pretraining transfers to robotic planning and that larger models exhibit positive transfer from language/vision tasks to embodied control.
RT-2 (Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control, 2023) completed the loop by fine-tuning a VLM (PaLI-X / PaLM-E) directly on robotic data, representing actions as text tokens. RT-2 showed dramatic generalization: the model could follow instructions involving concepts never seen in robot data, leveraging web-scale visual and linguistic knowledge. This established the VLA blueprint: pretrain large on web data, fine-tune on embodied data, emit actions as tokens.
RoboCat (Robocat A Self Improving Generalist Agent For Robotic Manipulation, 2023) took a complementary path to RT-2 by focusing on multi-embodiment generalization and self-improvement rather than web-scale pretraining. Building on the Gato architecture, RoboCat trained a single policy across 253 manipulation tasks on three real robot embodiments (Sawyer, Panda, KUKA), demonstrating that heterogeneous multi-robot data produces positive cross-task transfer. Its key innovation was an autonomous self-improvement loop: the trained model generates its own practice data, which is filtered for success and folded back into training, yielding measurable gains each iteration. RoboCat adapted to the unseen KUKA 14-DoF bimanual arm with ~80% success using only 100-1000 demonstrations, establishing that generalist robot policies can rapidly acquire new embodiments.
Octo (Octo An Open Source Generalist Robot Policy, RSS 2024) was the first fully open-source generalist robot policy, trained on 800K trajectories from the Open X-Embodiment dataset. Octo's modular transformer architecture with diffusion-based action decoding handled heterogeneous observations (multi-view images, language, proprioception) and could be fine-tuned to novel robots with ~100 demonstrations in under 5 hours on a consumer GPU. At 93M parameters, Octo-Base matched the 55B-parameter RT-2-X while being fully open, establishing the open-source baseline that OpenVLA later built upon.
OpenVLA (Openvla An Open Source Vision Language Action Model, 2024) scaled the open-source VLA paradigm pioneered by Octo to 7B parameters using a VLM backbone (Prismatic VLM with Llama 2). OpenVLA showed that the RT-2 paradigm works at smaller scale with open weights, enabling the broader research community to build on VLA foundations.
Transfer from robotics to driving
Several ideas from the robotics VLA lineage now appear directly in driving systems:
- Language-conditioned control: RT-2's language-to-action paradigm maps to driving systems like Lmdrive Closed Loop End To End Driving With Large Language Models and Simlingo Vision Only Closed Loop Autonomous Driving With Language Action Alignment.
- Action tokenization: Gato's and RT-2's approach of discretizing actions as tokens is adopted by Emma End To End Multimodal Model For Autonomous Driving.
- VLM as reasoning backbone: PaLM-E's architecture pattern appears in Senna Bridging Large Vision Language Models And End To End Autonomous Driving and Drivemlm Aligning Multi Modal Llms With Behavioral Planning States.
- Large-scale behavior cloning: RT-1's data scaling approach influences the push for larger driving datasets.
Where transfer breaks
- Safety criticality: Driving has far less tolerance for exploratory failure than tabletop manipulation. A dropped object is recoverable; a collision at highway speed is not.
- Multi-agent dynamics: Robotic manipulation is typically single-agent. Driving involves continuous interaction with dozens of agents whose behavior is partially adversarial and only partially observable.
- Temporal scale: Robotic manipulation episodes are seconds to minutes. Driving requires sustained competence over hours with rare but critical events.
- Evaluation standards: Robotics evaluation often measures task success rate in controlled settings. Driving demands statistical safety arguments over millions of miles.
The next frontier: humanoid robots and industry-scale VLA (2025)
GR00T N1 (Groot N1 An Open Foundation Model For Generalist Humanoid Robots, 2025) extends the VLA paradigm to full humanoid robots through a dual-system architecture: a vision-language module (System 2) at 10Hz for reasoning and a diffusion transformer (System 1) at 120Hz for motor control. Its "data pyramid" integrates web video, synthetic data, and real demonstrations, achieving 76.8% success on GR-1 humanoid tasks. GR00T N1 is released as an open foundation model.
Gemini Robotics (Gemini Robotics Bringing Ai Into The Physical World, 2025) brings Google's Gemini 2.0 into physical robotics with a two-tier model family. Gemini Robotics-ER handles spatial reasoning while the full Gemini Robotics model operates as a VLA at 50Hz via a cloud-local hybrid architecture. It demonstrates over 80% success on diverse manipulation tasks and 79% on long-horizon tasks including origami folding, with cross-embodiment transfer to novel platforms.
Cosmos (Cosmos World Foundation Model Platform For Physical Ai, 2025) provides the world model infrastructure that complements direct VLA approaches. By generating high-fidelity synthetic training data through diffusion and autoregressive world models, Cosmos addresses the data scarcity problem that limits embodied AI training.
Self-improvement beyond imitation
Self-Improving Embodied Foundation Models (Self Improving Embodied Foundation Models, 2025) from Google DeepMind completes the LLM training pipeline analogy for robotics by adding an autonomous RL stage after supervised fine-tuning. The model learns a "steps-to-go" prediction that serves as a self-generated reward function, enabling robots to practice autonomously without manual reward engineering. Policies trained with just 10% imitation data plus 1% autonomous practice outperform 80% imitation-only baselines. Most notably, robots achieve true behavioral generalization, learning to manipulate novel objects (bananas) never seen in training by discovering effective strategies through self-play. This establishes that the pretrain-SFT-RL pipeline from language models transfers to embodied AI.
Cross-embodiment and efficient VLA (2025)
UniAct (Uniact Universal Actions For Enhanced Embodied Foundation Models, CVPR 2025) proposes a Universal Action Space via VQ codebooks (256 codes, 128-dim) that captures generic atomic behaviors shared across 28 robot embodiments. The 0.5B model outperforms 14x larger models (OpenVLA-7B) by solving the action heterogeneity problem at the representation level rather than scaling model size. 40% of codebook entries produce consistent, interpretable behaviors across platforms.
Dita (Dita Scaling Diffusion Transformer For Generalist Vla Policy, ICCV 2025) scales diffusion transformers for cross-embodiment VLA with in-context conditioning: language, visual, and action tokens are processed in a unified causal Transformer sequence. At 334M parameters, Dita achieves 83.7% on SimplerEnv (vs OpenVLA's 16.3%) and adapts to real-world Franka tasks with just 10-shot fine-tuning.
SmolVLA (Smolvla A Vision Language Action Model For Affordable Robotics, 2025) from Hugging Face demonstrates that a 450M-parameter VLA with Flow Matching, layer skipping, and asynchronous inference can match 3.3B models (78.3% vs 61.7% on real-world tasks) while training on a single GPU. This makes practical VLA research accessible without large compute budgets.
Embodiment Scaling Laws (Embodiment Scaling Laws In Robot Locomotion, CoRL 2025) provides the first empirical evidence that training on diverse morphologies follows power-law scaling for generalization to unseen robots. Using ~1,000 procedurally generated embodiments (GENBOT-1K), the work achieves zero-shot sim-to-real transfer to Unitree Go2 and H1 hardware.
Data collection and orchestration at scale
AutoRT (Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents, 2024) from Google DeepMind addresses the data scarcity bottleneck from a different angle: instead of using foundation models as controllers, it uses VLMs and LLMs as intelligent orchestrators of large-scale robot data collection. Over 7 months, AutoRT deployed 53 robots across 4 buildings, collecting 77,000 episodes with 6,650+ unique tasks. A "Robot Constitution" (inspired by constitutional AI) ensures safe task generation, improving safety from 26% to 87% under adversarial testing. One human can supervise 3-5 robots simultaneously.
HPT (Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers, NeurIPS 2024) provides the architecture for consuming such heterogeneous data. Its modular stem-trunk-head design processes diverse proprioceptive and visual inputs through embodiment-specific stems into a shared transformer trunk (up to 1B+ parameters). HPT demonstrates clear scaling laws for robotics: performance improves predictably with data size, diversity, model size, and compute. This is among the first evidence that the language model scaling paradigm transfers to robotic control, with 10-30% gains in simulation and 20%+ gains on real robot tasks.
Present state and open problems
- Scale gap: Robotics VLA datasets (Open X-Embodiment: ~1M episodes) are orders of magnitude smaller than language pretraining corpora. Whether VLA scaling laws mirror language scaling laws is unknown, though embodiment scaling laws are now emerging.
- Sim-to-real: Both robotics and driving face sim-to-real transfer challenges, but the domains have developed largely separate simulation ecosystems. Embodiment scaling work shows promising zero-shot transfer.
- Action space design: The optimal action representation (continuous vs. discretized tokens vs. VQ codebooks vs. diffusion) remains contested. UniAct's universal actions and Dita's continuous diffusion offer competing paradigms.
- Real-time inference: Large VLA models (7B+ parameters) struggle with real-time control. SmolVLA demonstrates that compact models with async inference can match larger ones, suggesting architecture design matters more than scale.
- Cross-embodiment transfer: Whether a single VLA can control both a robot arm and a vehicle remains speculative but increasingly plausible, with UniAct covering 28 embodiments and scaling laws showing positive transfer across morphology classes.
Key papers
| Paper | Contribution |
|---|---|
| A Generalist Agent | Gato: single transformer for heterogeneous tasks including control |
| Rt 1 Robotics Transformer For Real World Control At Scale | Large-scale real-world robotic control via transformers |
| Palm E An Embodied Multimodal Language Model | LLM-scale model as embodied reasoning backbone |
| Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control | VLM fine-tuned for robotic action, web knowledge transfer |
| Robocat A Self Improving Generalist Agent For Robotic Manipulation | Multi-embodiment generalist with self-improvement loop, 253 tasks |
| Octo An Open Source Generalist Robot Policy | Open-source 93M generalist policy with diffusion action head, 800K trajectories (RSS 2024) |
| Openvla An Open Source Vision Language Action Model | Open-source 7B VLA model |
| Emma End To End Multimodal Model For Autonomous Driving | Driving system adopting robotics-style action tokenization |
| Lmdrive Closed Loop End To End Driving With Large Language Models | Language-conditioned closed-loop driving |
| Senna Bridging Large Vision Language Models And End To End Autonomous Driving | VLM reasoning backbone for driving |
| Groot N1 An Open Foundation Model For Generalist Humanoid Robots | Open dual-system VLA for humanoid robots |
| Gemini Robotics Bringing Ai Into The Physical World | Industry-scale VLA from Gemini 2.0 |
| Cosmos World Foundation Model Platform For Physical Ai | World foundation model platform for physical AI |
| Self Improving Embodied Foundation Models | Self-improving EFMs via steps-to-go RL, behavioral generalization |
| Uniact Universal Actions For Enhanced Embodied Foundation Models | Universal action space via VQ codebooks for cross-embodiment VLA |
| Dita Scaling Diffusion Transformer For Generalist Vla Policy | DiT-based VLA with in-context conditioning, 10-shot adaptation |
| Smolvla A Vision Language Action Model For Affordable Robotics | 450M VLA competitive with 10x larger models, async inference |
| Embodiment Scaling Laws In Robot Locomotion | First embodiment scaling laws across ~1000 robot morphologies |
| Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents | Foundation model orchestration for large-scale robot data collection |
| Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers | Cross-embodiment scaling laws via stem-trunk-head architecture |
| Video Prediction Policy A Generalist Robot Policy With Predictive Visual Representations | Video diffusion as predictive encoder for robot policies (ICML 2025 Spotlight) |
| Helix A Vla For Generalist Humanoid Control | Dual-system VLA for 35-DoF humanoid control at 200Hz (Figure AI) |
| Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models | Zero-shot manipulation via LLM-composed 3D value maps + MPC (CoRL 2023) |