Self-Improving Embodied Foundation Models

Overview

This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (understanding what tasks to perform), they typically struggle with behavioral generalization (learning entirely new ways to act). Drawing inspiration from the three-stage LLM training pipeline -- pretraining, supervised fine-tuning, and reinforcement learning -- the authors observe that current EFMs have focused primarily on the first two stages, missing the crucial RL component that has proven transformative for language models.

The paper proposes a two-stage post-training framework. Stage 1 performs supervised fine-tuning on robot demonstration data using standard behavioral cloning plus a novel "steps-to-go" prediction task that trains the model to estimate how many timesteps remain until goal completion. Stage 2 introduces Self-Improvement, where robots practice tasks autonomously using a data-driven reward function derived from the learned steps-to-go predictions -- eliminating the need for manual reward engineering.

The results are remarkable: policies trained with just 10% of the imitation dataset plus 1% additional robot interaction through Self-Improvement outperform behavioral cloning policies trained on 20-80% of the full dataset. Most significantly, robots achieve true behavioral generalization, learning to manipulate novel objects (bananas) never seen in training data by discovering effective strategies autonomously. This establishes a clear path toward adaptive robots that continuously improve with minimal human intervention.

Key Contributions

Two-stage post-training framework: Extends EFM training beyond behavioral cloning with a self-improvement stage using autonomous RL, mirroring the proven LLM training pipeline
Steps-to-go reward function: Derives dense, shaped rewards from the model's own learned distance-to-goal predictions, eliminating manual reward engineering: r(o_t, a_t, o_{t+1}, g) = d(o_t, g) - d(o_{t+1}, g)
Dramatic sample efficiency: 10% imitation data + 1% autonomous practice outperforms 20-80% imitation data alone, demonstrating that small amounts of RL practice yield outsized improvements
Behavioral generalization: Robots learn entirely new manipulation strategies for novel objects (bananas) not present in training data, discovering effective approaches like pushing from tips to prevent rotation
Foundation model pretraining is critical: Ablation shows models initialized from scratch or with unimodal pretraining fail at self-improvement; multimodal foundation model knowledge is essential

Architecture / Method

┌─────────────────────────────────────────────────────────────┐
│          Self-Improving Embodied Foundation Model            │
│                    (PaLI-3B backbone)                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Stage 1: Supervised Fine-Tuning                            │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐   ┌──────────────────────────────┐  │   │
│  │  │ Robot Demos  │──►│ PaLI-3B Foundation Model     │  │   │
│  │  │ (obs, goal)  │   │                              │  │   │
│  │  └─────────────┘   │  Loss 1: Behavioral Cloning  │  │   │
│  │                     │  Loss 2: Steps-to-Go d(o,g)  │  │   │
│  │                     └──────────────────────────────┘  │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                │
│                            ▼                                │
│  Stage 2: Self-Improvement (Autonomous RL)                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                                                      │   │
│  │  ┌────────┐  act   ┌────────────┐  observe           │   │
│  │  │ Policy │──────►│ Environment │──────┐             │   │
│  │  │(PaLI-3B│◄──────│            │      │             │   │
│  │  └────────┘reward └────────────┘      │             │   │
│  │       ▲                               │             │   │
│  │       │                               ▼             │   │
│  │  ┌────┴──────────────────────────────────────┐      │   │
│  │  │  Reward: r = d(o_t, g) - d(o_{t+1}, g)   │      │   │
│  │  │  (decrease in predicted steps-to-go)      │      │   │
│  │  │  No manual reward engineering needed      │      │   │
│  │  └───────────────────────────────────────────┘      │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  Result: 10% BC data + 1% RL practice > 80% BC data alone  │
└─────────────────────────────────────────────────────────────┘

Self-Improving EFM framework

The framework builds on PaLI-3B as the foundation model backbone.

Stage 1 -- Supervised Fine-Tuning: The model is fine-tuned on robot demonstrations with two objectives: (1) standard behavioral cloning loss predicting actions from observations and goals, and (2) a steps-to-go prediction loss training the model to estimate remaining timesteps to goal completion. The steps-to-go objective creates an internal distance metric that the model can later use for reward generation.

Stage 2 -- Self-Improvement: The robot practices tasks autonomously. At each timestep, the reward is computed as the decrease in predicted steps-to-go: r = d(o_t, g) - d(o_{t+1}, g). This reward is dense (available at every timestep), shaped (provides gradient signal throughout the trajectory), and requires no manual engineering. Success detection also derives from the steps-to-go predictions, enabling fully autonomous practice sessions.

The RL training uses standard policy gradient methods on top of the foundation model, with the key insight that the pretrained representations provide a stable foundation for RL that prevents catastrophic forgetting of semantic knowledge while enabling behavioral adaptation.

Simulation and real-world experimental domains

Results

Simulation results showing sample efficiency

Domain	Metric	Before Self-Improvement	After Self-Improvement
LanguageTable (sim)	Equivalent data	10% BC data	Outperforms 80% BC data
LanguageTable (real)	Success rate	62-63%	87-88%
BananaTable (novel objects)	Success rate	~63%	~85%
Aloha (bimanual)	Insertion success	Baseline BC	Improved with RL

In simulation, 10% imitation data + 1% autonomous interaction outperforms 20% and even 80% imitation-only baselines, demonstrating a roughly 2-8x reduction in required imitation data rather than scaling demonstrations alone. Real-world LanguageTable experiments show success rates improving from ~62% to ~88%. The BananaTable experiment demonstrates behavioral generalization: robots trained only on geometric blocks learn to manipulate bananas, discovering effective strategies never shown in demonstrations.

Behavioral generalization with novel objects

Limitations & Open Questions

Foundation model dependency: The approach critically requires multimodal foundation model pretraining; it does not work from scratch or with unimodal pretraining, limiting applicability to groups with access to large pretrained models
Task scope: Validated primarily on tabletop manipulation and insertion tasks; scaling to more complex, longer-horizon tasks (like driving) remains open
Safety during practice: Autonomous self-improvement requires the robot to explore, which may be unsafe in uncontrolled environments or safety-critical domains like driving

Connections

Extends the VLA paradigm from Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control (RT-2) with a self-improvement stage, completing the LLM training pipeline analogy
Builds on Palm E An Embodied Multimodal Language Model (PaLM-E) foundation model approach for robotics
The steps-to-go reward relates to goal-conditioned RL and hindsight experience replay literature
Relevant to the RL-for-driving trend seen in Alphadrive Unleashing The Power Of Vlms In Autonomous Driving (AlphaDrive), which applies GRPO to driving VLMs
Democratization potential connects to Openvla An Open Source Vision Language Action Model (OpenVLA) open-source VLA efforts