ESC

pi*0.6: A VLA That Learns From Experience

Read on arXiv

Overview

pi0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn primarily from human demonstrations (imitation learning), pi0.6 introduces RECAP (RL with Experience and Corrections via Advantage-conditioned Policies) -- an iterated offline RL method that enables the VLA to improve from its own successes and failures during real-world operation, as well as from human intervention corrections. pi*0.6 is itself an adaptation of the pi0.6 model, incorporating a larger VLM backbone (Gemma 3 4B) and increased action expert size (860M parameters), with the added ability to condition on a binarized advantage indicator. This is a critical capability gap: imitation learning is bounded by demonstration quality and cannot discover better strategies through trial and error.

RECAP substantially improved task throughput on difficult real-world tasks: espresso preparation doubled (~15→30 successes/hr), box assembly approximately doubled, and diverse laundry folding improved ~60% (~5→8 successes/hr). Failure rates were reduced by approximately 2x across tasks. The method works as offline RL, making it compatible with large pre-trained VLA models where on-policy RL would be prohibitively expensive. pi0.6 represents a significant step toward continuously self-improving robots that get better through deployment rather than requiring ever-more human demonstrations.

Key Contributions

  • RECAP: iterated offline RL for VLAs: An iterated offline reinforcement learning framework designed specifically for large pre-trained VLA models, using binary advantage-conditioned policies (text tokens "Advantage: positive/negative") to learn from mixed-quality experience data
  • Learning from deployment: The model improves from autonomous experience (successes and failures), human corrections (interventions during deployment), and reward signals -- three heterogeneous data sources integrated into a single learning framework
  • Large throughput gains: Espresso preparation and box assembly approximately doubled in successes/hr; diverse laundry folding improved ~60%; demonstrating practical value beyond benchmark improvements
  • ~2x failure reduction: Halved failure rates on evaluated tasks, showing that RL from experience directly addresses common failure modes that imitation learning cannot fix
  • Flow matching + RL integration: Demonstrates that flow matching action generation (as in pi0) is compatible with offline RL, maintaining the continuous action advantages while enabling policy improvement

Architecture / Method

              RECAP: Iterative Self-Improvement Loop
              ───────────────────────────────────────

  ┌──────────────────────┐
    VLA Policy (pi*0.6) │◄──────────────────────────┐
    (flow matching)                                
  └──────────┬───────────┘                           
              deploy                                
                                                    
  ┌──────────────────────┐                           
    Real-World Rollouts                            
    ┌────────────────┐                             
     Successes                                   
     Failures                                    
     Human Corrects                              
    └────────┬───────┘                             
  └───────────┼──────────┘                           
               + reward labels                      
                                                    
  ┌──────────────────────┐                           
    Value Function                                 
    Training                                       
    V(s) ──► A(s,a)       advantage estimates      
  └──────────┬───────────┘                           
                                                    
                                                    
  ┌──────────────────────┐      updated policy       
    Advantage-Cond.     ├───────────────────────────┘
    Policy Fine-tuning  
    (offline RL)        
  └──────────────────────┘

pi0.6 RECAP training loop

pi0.6 uses a Gemma 3 4B language backbone with an 860M parameter action expert and flow-matching for continuous action generation. A separate 670M-parameter distributional value function shares the VLA architecture and predicts steps-to-completion to derive advantage signals.

RECAP operates through iterative cycles:

  1. Deployment and data collection: The VLA policy is deployed on real robots, collecting trajectories that include autonomous successes, autonomous failures, and human-corrected interventions. Each trajectory is labeled with reward/outcome information.

  2. Value function training: A distributional value function (670M parameters, sharing the VLA architecture) is trained on multi-task demonstration data to predict steps-to-completion. It is then used to derive binary advantage indicators for each state-action pair in the deployment data.

  3. Advantage-conditioned policy training: The VLA policy is fine-tuned with binary advantage conditioning: the model receives a text token ("Advantage: positive" or "Advantage: negative") derived from the value function estimates, biasing generation toward actions that led to better outcomes. This is an offline RL approach -- no additional environment interaction is needed during training.

  4. Iteration: The improved policy is deployed, collecting new (hopefully better) experience, and the cycle repeats.

The method is specifically designed for offline RL because on-policy methods (PPO, SAC) would require running the full VLA model in a simulator or real environment during training -- prohibitively expensive for billion-parameter models. RECAP instead extracts learning signal from previously collected deployment data.

Evaluation results across tasks

Results

Task Metric pi0 baseline pi0.6 (RECAP) Improvement
Laundry folding (diverse) Successes/hr ~5 ~8 ~60% increase
Espresso preparation Successes/hr ~15 ~30 ~2x (100% increase)
Box assembly Completions/hr Baseline ~2x baseline ~2x increase
All tasks Failure rate Baseline ~50% of baseline ~2x reduction
  • Throughput gains: Doubled on espresso preparation (~15→30/hr); diverse laundry improved ~60% (~5→8/hr); box assembly also approximately doubled
  • Halved failures: ~2x reduction in failure rates, showing RECAP specifically addresses failure modes that imitation learning misses
  • Autonomous endurance: Espresso robot operated 13 hours without intervention; laundry folding ran autonomously for 2+ hours in novel environments
  • Heterogeneous data integration: Successfully learns from three data types simultaneously (autonomous experience, human corrections, reward signals)
  • Extended autonomous operation: The improved reliability enables longer uninterrupted autonomous deployment periods
  • Iterative improvement: Performance improves across RECAP iterations, confirming the self-improvement loop works in practice

Limitations

  • Requires a reward/outcome signal for each trajectory, which may be difficult to define for ambiguous or open-ended tasks
  • The offline RL approach is sample-efficient but still requires substantial deployment data collection, which is slow and expensive in the real world
  • Evaluation is limited to three tasks on Physical Intelligence's proprietary hardware; generalization of RECAP to other VLA architectures or embodiments is not demonstrated
  • The value function estimation may be noisy with limited deployment data, potentially leading to suboptimal advantage conditioning
  • No comparison with online RL methods or other offline RL algorithms (CQL, IQL); RECAP's relative advantage among RL approaches is unclear

Connections