ESC

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

📄 Read on arXiv

Overview

GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic control. The central insight is that web-scale video data (38 million clips) encodes rich physical priors about how the world works -- object interactions, spatial reasoning, dynamics -- and that these priors transfer directly to robot manipulation when the model is fine-tuned on robot-specific data. This addresses a fundamental bottleneck in robotics: the scarcity of robot demonstration data relative to the vast visual and physical knowledge available in internet video.

The architecture is a GPT-style autoregressive transformer that processes a unified token sequence of text, video frames, and robot actions. Training proceeds in two stages: (1) large-scale video-language pretraining on 38M video clips to learn a generative world model capable of predicting future video frames conditioned on text, and (2) robot-specific fine-tuning that integrates a conditional Variational Autoencoder (cVAE) for action generation. The video generation capability serves dual purposes: it provides interpretable action visualization (the model generates prediction videos aligned with real executions) and it encodes the physical priors that make downstream manipulation robust.

GR-2 achieves a 97.7% success rate across more than 100 tabletop manipulation tasks (with 400 demonstrations per task), 73.9% success with only 50 demonstrations per task, and 79.0% on industrial bin-picking (vs. GR-1's 33.3%). On the CALVIN benchmark, it reaches 98.6% single-task success with an average task completion length of 4.64. Notably, success rates scale monotonically with model size across four evaluated sizes (30M–719M trainable parameters), providing evidence for scaling laws in robotic VLA.

Key Contributions

  • Web-scale video pretraining for robotics: Demonstrates that pretraining on 38M internet video clips provides transferable physical priors that dramatically improve robot manipulation, bridging the data gap between web-scale vision and scarce robot demonstrations
  • Unified video-language-action architecture: A single GPT-style transformer processes text, video, and action tokens in a shared sequence, enabling joint video prediction and action generation without modality-specific architectural changes
  • Conditional VAE for action generation: Integrates a cVAE module during robot fine-tuning to handle the multimodal nature of action distributions, improving over deterministic action prediction
  • Scaling evidence for robotic VLA: Shows clear performance scaling across four model sizes (30M–719M trainable parameters), providing among the earliest scaling law evidence for embodied foundation models
  • Interpretable video predictions: The model generates high-quality prediction videos that align with real-world executions, providing a built-in interpretability mechanism for robot behavior

Architecture / Method

┌─────────────────────────────────────────────────────────────┐
        GR-2: Video-Language-Action Model                     
                                                             
  Stage 1: Video-Language Pretraining (38M clips)            
  ┌────────────┐  ┌────────────┐  ┌──────────────────┐      
   Text          Video         GPT-Style              
   Description│──│ Frames     │──│ Autoregressive         
   Tokens        (encoded)     Transformer            
  └────────────┘  └────────────┘                          
                                   Objective:             
                                   predict future         
                                   video frames           
                                  └────────┬─────────┘      
                                                           
  Stage 2: Robot Fine-tuning                               
  ┌─────────────────────────────────────────────────┐       
    Unified Token Sequence                                
    [text] [video_t] [video_t+1] ... [action]            
                                                          
    ┌─────────────────────────────────┐                   
      Pretrained Transformer                            
      (world model backbone)                            
    └───────────────┬─────────────────┘                   
                                                         
                                                         
    ┌─────────────────────────────────┐                   
      Conditional VAE (cVAE)                            
      ┌─────────┐    ┌────────────┐                     
       Encoder  │──►│ Latent z                        
      (train)                                       
      └─────────┘    └─────┬──────┘                     
                                                       
                   ┌────────────┐                       
                    Decoder    │────┼──► Actions        
                   └────────────┘                       
    └─────────────────────────────────┘                   
  └─────────────────────────────────────────────────┘       
                                                             
  Scaling: 30M ──► 95M ──► 312M ──► 719M (trainable params) 
└─────────────────────────────────────────────────────────────┘

GR-2 architecture overview

GR-2 uses a GPT-style autoregressive transformer as its backbone. The model operates over a unified token sequence that interleaves text tokens, visual tokens (from video frames encoded via a vision encoder), and robot action tokens.

Stage 1: Video-Language Pretraining. The model is pretrained on 38 million video clips to predict future video frames conditioned on text descriptions. This stage learns a generative world model that captures physical dynamics, object permanence, spatial relationships, and cause-effect patterns from web-scale data. The video generation objective forces the model to learn structured internal representations of how the physical world evolves.

Stage 2: Robot Fine-tuning with cVAE. The pretrained model is fine-tuned on robot demonstration data. A conditional Variational Autoencoder (cVAE) is integrated to model the multimodal distribution of possible actions given an observation. The cVAE encodes demonstration actions into a latent space during training and samples from this space during inference, allowing the model to capture the inherent ambiguity in manipulation tasks (e.g., multiple valid grasp poses for the same object). The robot's proprioceptive state and action tokens are injected into the transformer's token sequence alongside the visual and language tokens.

GR-2 qualitative results

The video prediction capability is preserved during fine-tuning: GR-2 can generate prediction videos showing the expected outcome of planned actions, providing a natural interpretability mechanism. These predicted videos align well with actual robot executions.

Results

Multi-task success rates

Setting Metric GR-2 GR-1 Notes
Multi-task tabletop (400 demos/task) Success rate 97.7% -- 100+ tasks
Multi-task tabletop (50 demos/task) Success rate 73.9% -- Low-data regime
Industrial bin-picking Success rate 79.0% 33.3% 2.4x improvement
CALVIN (single-task) Success rate 98.6% -- Standard benchmark
CALVIN (chained) Avg. task length 4.64 -- Long-horizon

Scaling behavior:

Model size (trainable) Success rate (relative trend)
30M (GR-2-S) lowest
95M (GR-2-B)
312M (GR-2-L)
719M (GR-2-XL) highest

The paper evaluates four model sizes (30M, 95M, 312M, 719M trainable parameters) and shows monotonically increasing success rate with model size (reported as a figure, not exact numbers per size). This suggests continued scaling may yield further gains, aligning with scaling trends observed in HPT and the broader hypothesis that the language model scaling paradigm transfers to robotics.

Scaling and bin-picking results

Limitations & Open Questions

  • Compute cost of video pretraining: Pretraining on 38M video clips requires substantial compute, and the paper does not provide detailed ablations on how much video data is truly necessary for the transfer benefit
  • Action space generality: The cVAE action generation is demonstrated primarily on tabletop manipulation and bin-picking; generalization to higher-DoF platforms (humanoids, mobile manipulation) remains unvalidated
  • Real-time inference: The paper does not report inference latency; large GPT-style models with video generation may struggle with the real-time requirements of dynamic manipulation
  • Video prediction fidelity: While prediction videos align with executions qualitatively, it is unclear how prediction errors compound over longer horizons
  • Comparison breadth: Direct comparisons are primarily against GR-1; more extensive benchmarking against RT-2, OpenVLA, or pi0 would better contextualize the results

Connections

Related papers in the wiki: