RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

Overview

RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation -- using two robot arms simultaneously -- is significantly more challenging than single-arm manipulation due to the high-dimensional action space (14+ DoF), the need for coordination between arms, and the scarcity of bimanual training data. RDT-1B addresses these challenges by adapting the Diffusion Transformer (DiT) architecture from image generation to robot action generation, with specific innovations for bimanual coordination.

The model is pre-trained on a large multi-robot dataset and fine-tuned for bimanual tasks, demonstrating that the diffusion transformer scaling paradigm transfers from image generation to robotic manipulation. RDT-1B achieves state-of-the-art results on bimanual benchmarks, including tasks requiring tight two-arm coordination like folding cloth, pouring between containers, and bimanual pick-and-place.

Key Contributions

Largest diffusion transformer for manipulation: 1.2B parameters, the first diffusion-based foundation model explicitly designed for bimanual manipulation, demonstrating that scaling improves coordination performance
Physically interpretable unified action space: A 128-dimensional unified action space maps heterogeneous robot actions while preserving physical meaning, enabling cross-robot transfer learning
Architectural adaptations for robotics: QKNorm, RMSNorm, non-linear MLP decoder, and Alternating Condition Injection (ACI) for balanced vision-language conditioning and stable scaling
Pre-training + fine-tuning paradigm: Pre-trained on 46 datasets comprising 1M+ trajectories and 21TB of data, then fine-tuned on 6,000+ bimanual task trajectories across 300+ tasks

Architecture

┌─────────────┐  ┌───────────────┐  ┌──────────────────┐
│ Multi-View  │  │   Language    │  │  Proprioceptive  │
│ Camera Imgs │  │  Instruction  │  │  State (L+R arm) │
│ (3 views)   │  │  (variable)   │  │  MLP + Fourier   │
└──────┬──────┘  └───────┬───────┘  └────────┬─────────┘
       │                 │                   │
       ▼                 ▼                   │
┌──────────────┐ ┌───────────────┐           │
│ Frozen Vision│ │ Frozen T5-XXL │           │
│ Encoder      │ │  Text Encoder │           │
│  (SigLIP)    │ │  + attn mask  │           │
└──────┬───────┘ └───────┬───────┘           │
       │                 │                   │
       │  Alternating Condition Injection    │
       └────────┐ ┌──────┘                   │
                ▼ ▼                          │
┌───────────────────────────────────────┐    │
│     Diffusion Transformer (1.2B)     │    │
│  ┌─────────────────────────────────┐  │    │
│  │ Noised Action Tokens            │◄─┼────┘
│  │ 128-dim unified action space    │  │
│  ├─────────────────────────────────┤  │
│  │ Self-Attention (QKNorm)         │  │
│  │ Cross-Attention (ACI)           │  │
│  │ RMSNorm + MLP decoder           │  │
│  └─────────────────────────────────┘  │
│          × N DiT blocks              │
└───────────────────┬───────────────────┘
                    │
                    ▼  DPM-Solver++ (5 steps)
          ┌─────────────────┐
          │  Action Chunk   │
          │  @ 6 Hz freq    │
          │  Left + Right   │
          └─────────────────┘

Method

RDT-1B adapts the Diffusion Transformer (DiT) architecture for robot action generation with robotics-specific modifications:

Input Representation: The model takes as input: - Three camera views (encoded via a frozen SigLIP vision encoder with multi-dimensional positional embeddings) - Language instruction (encoded via a frozen T5-XXL encoder with attention masks for variable-length instructions) - Proprioceptive state of both arms (encoded via MLP with Fourier features for high-frequency dynamics) - Noised action sequence (the diffusion input)
Diffusion Transformer (DiT) Backbone: The core architecture is a 1.2B parameter transformer processing noised action tokens over a 128-dimensional unified action space (which maps heterogeneous robot actions while preserving physical meaning): - Action tokenization: Actions are represented in the 128-dim unified action space, enabling cross-robot pre-training - Conditioning via ACI: Vision and language features are balanced through Alternating Condition Injection (ACI), which alternates the primary conditioning modality across layers - Normalization: QKNorm and RMSNorm replace standard LayerNorm for training stability at scale - Non-linear MLP decoder: A non-linear MLP output head approximates the nonlinearity in robot action distributions, replacing the standard linear projection
Training: Pre-trained on 46 datasets comprising over 1 million trajectories and 21TB of data for one month on 48 H100 GPUs. Fine-tuned on 6,000+ bimanual task trajectories (300+ tasks, 100+ objects, 15+ scenes) for three days on the same hardware. Data augmentation includes color jittering, image corruption, Gaussian noise, and instruction augmentation.
Inference: DPM-Solver++ reduces diffusion steps from 100 to 5, achieving 6 Hz action chunk frequency (with 381 Hz average action predictions from those chunks). Action chunking provides temporal consistency and enables coordinated bimanual motions.

Scaling: Ablations compare a 166M parameter variant to the full 1.2B model. The larger model substantially outperforms the smaller one, with scale particularly benefiting coordination-heavy tasks.

Results

RDT-1B achieves a 56% average improvement in success rates over state-of-the-art baselines (ACT, OpenVLA, and Octo) across diverse bimanual tasks on real ALOHA dual-arm robots. Key findings:

Bimanual SOTA: Highest success rates across all evaluated bimanual task categories vs. ACT, OpenVLA, and Octo
Zero-shot generalization: Maintains high success rates on unseen objects and scenes (e.g., novel cups, unfamiliar rooms) and follows novel language instructions (e.g., "pour water one-third full")
Few-shot learning: Learns complex new skills from 1-5 demonstrations, substantially outperforming baselines on tasks like "Handover" (5-shot) and "Fold Shorts" (1-shot)
Diffusion is critical: Ablating diffusion in favor of regression drops instruction-following success from 100% to 12.5%
Scale matters: The 1.2B model substantially outperforms the 166M variant, with scale particularly benefiting coordination-heavy tasks
Pre-training is crucial: Training from scratch without multi-robot pre-training severely degrades generalization to unseen scenarios

Limitations & Open Questions

1.2B parameters requires substantial GPU memory for inference; real-time deployment on resource-constrained robots is challenging
Bimanual training data is much scarcer than single-arm data; the model's performance ceiling may be limited by data availability rather than model capacity
DPM-Solver++ reduces denoising to 5 steps enabling 6 Hz chunk frequency, but this still limits responsiveness for highly dynamic bimanual tasks
The model does not explicitly reason about contact forces or tactile feedback, which are important for tight-tolerance bimanual assembly tasks
Evaluation is on real ALOHA dual-arm robots but still in controlled lab tasks; robustness to broader real-world variability remains untested

Connections

Robotics -- bimanual manipulation and foundation models
Foundation Models -- scaling diffusion transformers for robotics
Openvla An Open Source Vision Language Action Model -- complementary VLA approach (autoregressive vs. diffusion)
Denoising Diffusion Probabilistic Models -- foundational DDPM framework
Rt 1 Robotics Transformer For Real World Control At Scale -- RT-1 transformer for manipulation
An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale -- ViT/DiT architecture lineage