Overview
HPT tackles the fundamental challenge of building generalist robot representations that work across heterogeneous embodiments with different sensor configurations, action spaces, and morphologies. The key insight is a modular stem-trunk-head architecture: embodiment-specific stems tokenize diverse proprioceptive and visual inputs into a shared format, a large shared transformer trunk learns cross-embodiment representations, and task-specific heads decode actions. Trained on 52 datasets spanning simulation and real robots, HPT demonstrates clear scaling laws for robotics -- performance improves predictably with data size, data diversity, model size (up to 1B+ parameters), and compute. This is one of the first works to establish that the scaling paradigm from language models transfers to robotic control.
Key Contributions
- Stem-trunk-head architecture: Modular design enabling a single shared transformer to process inputs from heterogeneous embodiments with different sensors and action spaces
- Cross-modal fusion via cross-attention: Proprioceptive tokens serve as queries attending to visual tokens as keys/values, enabling effective sensor fusion without modality-specific architectures
- Scaling laws for robotics: Demonstrates predictable performance improvements across four axes -- data quantity, data diversity, model size, and compute -- analogous to LLM scaling laws
- Joint modality training superiority: Combined proprioceptive + visual training outperforms either modality alone or sequential modality addition
- Multi-source data benefit: Incorporating simulation and human video data alongside real robot data improves performance over single-source training
Architecture / Method
┌──────────────────────────────────────────────────────────────┐
│ HPT ARCHITECTURE │
│ │
│ Embodiment A Embodiment B Embodiment C │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Cam + Jts│ │ 2xCam+Jts│ │ Cam + Jts│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Stem A │ │ Stem B │ │ Stem C │ │
│ │┌────────┐│ │┌────────┐│ │┌────────┐│ │
│ ││Vis Tok. ││ ││Vis Tok. ││ ││Vis Tok. ││ │
│ │└────────┘│ │└────────┘│ │└────────┘│ │
│ │┌────────┐│ │┌────────┐│ │┌────────┐│ │
│ ││Prop Tok.││ ││Prop Tok.││ ││Prop Tok.││ │
│ │└────────┘│ │└────────┘│ │└────────┘│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ Shared token format │ │ │
│ └───────────────────────┼──────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Shared Transformer Trunk │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Cross-Attention: Prop Q ──► Visual K,V │ │ │
│ │ │ Self-Attention across all tokens │ │ │
│ │ │ (pre-trained on 52 datasets, up to 1B+) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌─────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Head A │ │ Head B │ │ Head C │ │
│ │ (actions) │ │ (actions) │ │ (actions) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────────┘
HPT uses a three-component architecture:
Stems (embodiment-specific): - Proprioceptive Tokenizer: processes joint positions and velocities into fixed-length token sequences - Visual Tokenizer: handles multi-viewpoint camera inputs via vision encoders - Each stem maps heterogeneous inputs to a consistent token format for the shared trunk
Trunk (shared transformer): - Large transformer network shared across all embodiments and tasks - Cross-attention mechanism fuses proprioceptive queries with visual keys/values - Pre-trained on 52 datasets spanning simulation (Fleet-Tools, Metaworld, Robomimic) and real robots - Scales up to 1B+ parameters
Heads (task-specific): - Three variants tested: pooling layers, diffusion models, and transformer layers - Decode shared representations into embodiment-specific actions
Results
| Setting | HPT vs. Scratch | Details |
|---|---|---|
| Simulation (Fleet-Tools, Metaworld, Robomimic) | +10-30% success rate | Consistent gains across all benchmarks |
| Real robot (4 manipulation tasks) | +20%+ success rate | Sweep, water-filling, food-scooping, switch insertion |
| Data scaling | Monotonic improvement | More data always helps; diversity matters more than quantity |
| Model scaling | Monotonic improvement | Larger models consistently outperform smaller ones (up to 1B+) |
| Compute scaling | Monotonic improvement | More compute leads to better final performance |
Key ablation: joint proprioceptive + visual training outperforms either modality alone, and removing either modality degrades performance. Simulation and human video data provide complementary benefits to real robot data.
Limitations
- Scaling laws are demonstrated but not yet at the scale of language model experiments (1B vs 100B+)
- Action spaces remain discretized per embodiment head -- no universal action representation
- Real-world evaluation limited to four manipulation tasks; generalization to locomotion, navigation, or more complex manipulation untested
- Pre-training data is predominantly manipulation; whether the scaling trends hold for other robot task families is unknown
Connections
- Extends the cross-embodiment vision of Openvla An Open Source Vision Language Action Model and Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control to explicitly heterogeneous sensor/action spaces
- Scaling laws connect to Scaling Laws For Neural Language Models -- HPT is evidence that the language model scaling paradigm transfers to robotics
- The cross-attention fusion mechanism relates to Groot N1 An Open Foundation Model For Generalist Humanoid Robots which also uses cross-attention between reasoning and action modules
- Relevant to Robotics open problem of cross-embodiment transfer
- Informs Foundation Models on whether pretrain-then-adapt works for embodied AI