Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Overview

Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Prior diffusion-based VLA methods condition denoising on fused embeddings via shallow MLP networks, limiting their capacity for fine-grained alignment between actions and observations. Dita instead uses "in-context conditioning" -- language tokens, visual features, and timestep embeddings are concatenated before the noisy action sequence within a causal Transformer, enabling the full attention mechanism to align denoised actions with raw visual tokens from historical observations.

This architectural choice makes Dita inherently scalable: by scaling the diffusion action denoiser alongside the Transformer, the model effectively integrates cross-embodiment datasets spanning diverse camera perspectives, observation scenes, tasks, and action spaces. At only 334M parameters (221M trainable), Dita achieves state-of-the-art or competitive performance across extensive simulation benchmarks and demonstrates robust real-world adaptation to environmental variances and complex long-horizon tasks through just 10-shot finetuning with only third-person camera inputs.

Key Contributions

In-context diffusion conditioning: Full Transformer processes language, visual, and action tokens in a unified sequence, replacing shallow conditioning networks with rich cross-modal attention
Scalable cross-embodiment architecture: DiT-based design naturally scales to diverse robot configurations, camera perspectives, and action spaces present in large datasets
10-shot real-world adaptation: Achieves robust real-world manipulation from just 10 demonstrations per task via fine-tuning, using only third-person camera inputs
Lightweight open-source baseline: 334M parameters providing an accessible foundation for generalist robot policy research
Continuous action generation: Directly denoises continuous action sequences, avoiding discretization artifacts of token-based VLA approaches

Architecture / Method

┌──────────────────────────────────────────────────────────────┐
│                    Dita (334M params)                         │
│                                                              │
│  ┌────────────┐  ┌────────────┐  ┌──────────────────────┐   │
│  │ CLIP Text  │  │  DINOv2    │  │ Diffusion Timestep   │   │
│  │ Encoder    │  │ + Q-Former │  │ Embedding            │   │
│  │ (frozen)   │  │ (trained)  │  │                      │   │
│  └─────┬──────┘  └─────┬──────┘  └──────────┬───────────┘   │
│        │               │                     │               │
│        ▼               ▼                     ▼               │
│  ┌─────────────────────────────────────────────────────┐     │
│  │          In-Context Token Sequence                  │     │
│  │  [lang tokens | image features | timestep | noisy  │     │
│  │                                            actions] │     │
│  └──────────────────────┬──────────────────────────────┘     │
│                         │                                    │
│                         ▼                                    │
│  ┌─────────────────────────────────────────────────────┐     │
│  │         Causal Transformer (LLaMA-style)            │     │
│  │  ┌─────────────────────────────────────────────┐    │     │
│  │  │  Full self-attention across all tokens       │    │     │
│  │  │  (language ◄─► vision ◄─► actions)           │    │     │
│  │  │  Enables fine-grained cross-modal alignment  │    │     │
│  │  └─────────────────────────────────────────────┘    │     │
│  └──────────────────────┬──────────────────────────────┘     │
│                         │                                    │
│                         ▼                                    │
│  ┌─────────────────────────────────────────────────────┐     │
│  │  Denoised Action Sequence (7D continuous)           │     │
│  │  [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]       │     │
│  └─────────────────────────────────────────────────────┘     │
│                                                              │
│  DDIM Sampling: 20 denoising steps at inference (T_eval=20)  │
└──────────────────────────────────────────────────────────────┘

Dita framework overview

Dita architecture diagram

Input Processing: - Language: Frozen CLIP tokenization for task instructions - Vision: DINOv2 feature extraction (jointly optimized during training) with Q-Former for computational efficiency and feature selection - Actions: 7D continuous vectors (3D translation, 3D rotation, 1D gripper position)

Core Design: - Causal Transformer (LLaMA-style architecture, trained from scratch) - In-context conditioning: language tokens + image features + diffusion timestep embeddings are concatenated and placed before the noisy action sequence in the input - Standard denoising diffusion objective with MSE loss - DDIM sampling with T_eval = 20 denoising steps at inference (versus 1000-step DDPM training schedule)

Training: 100,000 steps, batch size 8,192, across 32 A100 GPUs. The model explicitly models action deltas and environmental nuances through fine-grained visual-action alignment.

Results

Real robot experiment results

Benchmark	Dita	OpenVLA	Improvement
SimplerEnv (coke can)	83.7%	16.3%	+67.4%
LIBERO overall	82.4%	76.5%	+5.9%
LIBERO-LONG	63.8%	~54%	+10%
ManiSkill2	65.8%	-	-
CALVIN avg length	3.61	-	-

Real-world (Franka Panda, 10-shot): 63.8% average success rate on complex two-step manipulation tasks
Robust adaptation across pick-and-place, pouring, stacking, and rotation tasks
Superior performance on long-horizon multi-step sequences
Faster convergence and lower final training loss compared to diffusion action head baselines

Limitations

334M parameter model is small by VLA standards; scaling behavior at 1B+ parameters is not explored
10-shot adaptation, while impressive, is evaluated on relatively simple two-step tasks; complex multi-stage tasks may need more data
Relies on frozen CLIP for language and jointly-trained DINOv2 for vision; end-to-end training of all components could improve alignment
Real-world evaluation limited to Franka Panda with third-person camera; generalization to diverse embodiments and viewpoints in the real world is not validated

Connections

Openvla An Open Source Vision Language Action Model -- Dita significantly outperforms OpenVLA on SimplerEnv and LIBERO while being 20x smaller
Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control -- RT-2 uses token-based action discretization; Dita uses continuous diffusion
Uniact Universal Actions For Enhanced Embodied Foundation Models -- Both address cross-embodiment action spaces; UniAct uses VQ codebooks while Dita uses continuous diffusion
Groot N1 An Open Foundation Model For Generalist Humanoid Robots -- GR00T N1 also uses a diffusion transformer (System 1) for motor control at high frequency
Robotics -- Advances cross-embodiment generalist policy learning
Vision Language Action -- Demonstrates diffusion-based alternative to autoregressive action generation