Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

:page_facing_up: Read on arXiv

Overview

Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future frames but also powerful visual encoders whose internal representations encode both current scene understanding and future dynamics prediction. Rather than using VDMs to generate future frames and then planning on them, VPP extracts the predictive visual representations from a single forward pass through a fine-tuned VDM and feeds them into a Diffusion Policy head (Diffusion Transformer) to predict robot actions.

This reinterpretation of video diffusion models as predictive encoders (rather than generators) yields a simple but effective robot policy: fine-tune a video foundation model on robot data, extract internal representations, and learn a lightweight action head. VPP achieves an 18.6% relative improvement on the CALVIN ABC-D generalization benchmark (4.33 vs. prior SOTA of 3.65 from RoboUniview) and a 31.6% success rate increase on complex real-world dexterous manipulation tasks.

Key Contributions

Video diffusion models as predictive visual encoders: Demonstrates that the internal representations of VDMs inherently encode future dynamics, making them powerful visual backbones for robot policies without requiring explicit future frame generation
Two-stage training pipeline: Stage 1 fine-tunes a video foundation model on robot datasets + internet human manipulation data; Stage 2 trains a Diffusion Policy action head (Diffusion Transformer) conditioned on the VDM's internal representations
Generalist robot policy: A single model handles multiple robot embodiments (Franka Panda arm, Xarm dexterous hand) and diverse manipulation tasks with language conditioning
Strong empirical results: 18.6% relative improvement on CALVIN ABC-D over prior SOTA (RoboUniview, 3.65→4.33), 10.8% relative improvement on MetaWorld over GR-1, 31.6% real-world dexterous improvement

Architecture

┌──────────────────────────────────────────────────────────┐
│              Video Prediction Policy (VPP)                 │
│                                                           │
│  Stage 1: Fine-tune Video Diffusion Model                 │
│  ──────────────────────────────────────                   │
│  Robot demos + Internet manipulation videos               │
│       │                                                  │
│       ▼                                                  │
│  ┌────────────────────────────────┐                       │
│  │  Video Diffusion Model (SVD)   │                      │
│  │  (pre-trained, then fine-tuned)│                      │
│  └────────────────────────────────┘                       │
│                                                           │
│  Stage 2: Extract Representations + Action Head           │
│  ──────────────────────────────────────────               │
│  Current Observation   Language Instruction                │
│       │                      │                           │
│       ▼                      │                           │
│  ┌────────────────────────┐  │                            │
│  │ Fine-tuned VDM         │  │                            │
│  │ (single forward pass)  │◄─┘                            │
│  │                        │                              │
│  │ Internal features ─────┼──► Predictive representations │
│  │ (multi-layer, multi-t) │    (encode future dynamics)   │
│  └────────────────────────┘                               │
│              │                                            │
│              ▼                                            │
│  ┌────────────────────────┐                               │
│  │  Diffusion Policy Head  │  Diffusion Transformer        │
│  │  (trained on demos)     │                              │
│  └───────────┬────────────┘                               │
│              ▼                                            │
│         Robot Actions (arm joints + gripper)               │
└──────────────────────────────────────────────────────────┘

Architecture / Method

VPP representation approach comparison

Complete VPP architecture

Stage 1 -- Video Foundation Model Fine-tuning: A pre-trained video diffusion model (e.g., based on Stable Video Diffusion) is fine-tuned on a mixture of robot demonstration videos and internet human manipulation videos. The fine-tuning adapts the model's learned visual dynamics to the robot manipulation domain. The key insight is that this fine-tuning process causes the model's internal representations to encode task-relevant dynamics (what will happen next) rather than just static scene features.

Stage 2 -- Diffusion Policy Action Head: Given a current observation, a single forward pass through the fine-tuned VDM produces internal feature representations at multiple up-sampling layers. These representations -- which encode predicted future visual states -- are aggregated and compressed by a Video Former module (learnable query tokens processed through spatial-temporal attention), then fed into a Diffusion Policy head (Diffusion Transformer) that denoises actions conditioned on the compressed representations. The action head is trained on paired (observation, action) data from robot demonstrations.

The critical design choice is using the VDM's internal representations rather than generated frames. Generated frames introduce compounding errors and are expensive to produce; internal representations capture the same predictive information in a compact, differentiable form suitable for policy learning.

Predictive capabilities visualization

Data: Training uses a combination of robot-specific demonstration data and internet-scale human manipulation videos. The internet data provides broad visual understanding of manipulation dynamics, while robot data provides embodiment-specific action labels.

Diverse manipulation tasks

Results

CALVIN ABC-D Benchmark (Language-conditioned)

Method	Avg. Task Completion (5 tasks)
SuSIE	2.69
GR-1	3.06
Vidman	3.42
RoboUniview	3.65
VPP	4.33 (+18.6% over prior SOTA RoboUniview; +41.5% over GR-1)

MetaWorld Multi-task

Method	Success Rate
GR-1	0.574
VPP	0.682 (+10.8% over GR-1)

Real-world Dexterous Manipulation (XHand)

VPP achieves a 31.6% success rate increase over baselines on complex real-world dexterous manipulation tasks with a dexterous hand (Xarm), demonstrating that video diffusion representations transfer effectively to high-DoF control.

Generalization to novel objects

Limitations

Requires fine-tuning a large video diffusion model, which is computationally expensive (Stage 1)
Action frequency is limited by the forward pass speed of the VDM backbone
Relies on quality of internet manipulation data for generalization; domain gap between internet videos and robot workspace may limit transfer
Currently predicts actions via a Diffusion Policy head rather than explicit planning, which may limit long-horizon task performance

Connections

Extends the VLA paradigm from Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control (web knowledge transfer to robot control) but uses video diffusion instead of VLMs
Builds on the generalist agent concept from A Generalist Agent (Gato) but with a stronger visual backbone
Related to Openvla An Open Source Vision Language Action Model as an alternative approach to generalist robot policies
The video prediction paradigm connects to world models like Cosmos World Foundation Model Platform For Physical Ai which also use generative models for physical AI
The dual-stage training connects to Helix A Vla For Generalist Humanoid Control which also separates high-level understanding from low-level control
Self-improvement connections to Self Improving Embodied Foundation Models which addresses robot policy improvement