Helix: A Vision-Language-Action Model for Generalist Humanoid Control

Overview

Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire humanoid upper body, including wrists, torso, head, and individual fingers across 35 degrees of freedom. The core innovation is a dual-system "System 1, System 2" architecture that separates high-level semantic reasoning (slow, expressive VLM) from low-level motor control (fast, lightweight visuomotor policy).

System 2 is a 7-billion-parameter Vision-Language Model (VLM) that runs at 7-9 Hz, handling scene understanding, language comprehension, and task-level planning. System 1 is an 80-million-parameter visuomotor policy that translates System 2's latent semantic representations into precise continuous actions at 200 Hz. This separation allows Helix to combine the broad generalization and language understanding of VLMs with the fast reactive control needed for dexterous manipulation.

Helix is the first VLA to simultaneously operate two humanoid robots for shared long-horizon manipulation tasks with novel objects, and runs entirely onboard embedded low-power GPUs, making it deployment-ready.

Key Contributions

Dual-system VLA architecture (System 1 + System 2): Separates slow semantic reasoning (7B VLM at 7-9 Hz) from fast motor control (80M policy at 200 Hz), achieving both broad generalization and precise dexterous control
First whole-body humanoid VLA: Controls 35 DoF including individual fingers, wrists, torso, and head -- far beyond prior VLAs that controlled only 6-7 DoF robot arms
Dual-robot coordination: First VLA to simultaneously control two humanoid robots solving a shared manipulation task
Auto-labeled training data: Collects ~500 hours of multi-robot multi-operator teleoperated data, using a VLM to generate hindsight language instruction labels automatically
Onboard deployment: Runs entirely on embedded low-power GPUs, enabling commercial deployment

Architecture / Method

┌──────────────────────────────────────────────────────────────┐
│                  HELIX DUAL-SYSTEM VLA                        │
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐           │
│  │ Camera 1  │  │ Camera 2 │  │ Language          │           │
│  │ (head)    │  │ (wrist)  │  │ Instruction       │           │
│  └────┬──────┘  └────┬─────┘  └───────┬──────────┘           │
│       │              │                │                       │
│       ▼              ▼                ▼                       │
│  ┌──────────────────────────────────────────────┐            │
│  │        System 2 — Slow Brain (7-9 Hz)        │            │
│  │        7B Vision-Language Model               │            │
│  │  ┌──────────────────────────────────────┐    │            │
│  │  │ Scene Understanding + Language Ground.│    │            │
│  │  │ Task Planning + Subtask Sequencing    │    │            │
│  │  └──────────────────────────────────────┘    │            │
│  └─────────────────────┬────────────────────────┘            │
│                        │ Latent Semantic                      │
│                        │ Representations                      │
│                        ▼                                      │
│  ┌──────────────────────────────────────────────┐            │
│  │        System 1 — Fast Brain (200 Hz)        │            │
│  │        80M Visuomotor Policy                  │            │
│  │  ┌──────────────────────────────────────┐    │            │
│  │  │ Continuous Joint Actions (35 DoF)     │    │            │
│  │  │ Fingers + Wrists + Torso + Head       │    │            │
│  │  └──────────────────────────────────────┘    │            │
│  └─────────────────────┬────────────────────────┘            │
│                        │                                      │
│                        ▼                                      │
│              ┌──────────────────┐                             │
│              │  Joint Commands   │                             │
│              │  (5ms per action) │                             │
│              └──────────────────┘                             │
└──────────────────────────────────────────────────────────────┘

The architecture is organized around two interacting systems:

System 2 (Slow Brain): A 7-billion-parameter VLM pre-trained on internet-scale vision-language data. Takes multi-camera images and natural language instructions as input. Produces rich latent semantic representations that encode the current scene state, object identities, spatial relationships, and task progress. Operates at 7-9 Hz due to computational cost.

System 2 provides: - Scene understanding (what objects are present, their properties) - Language grounding (mapping instructions to visual referents) - Task-level planning (what subtask to execute next) - Contextual embeddings passed to System 1

System 1 (Fast Brain): An 80-million-parameter visuomotor policy that takes System 2's latent representations plus raw visual input and produces continuous joint-level actions at 200 Hz. This high control frequency is essential for dexterous manipulation -- grasping, reorienting, and placing objects requires sub-10ms reactive control.

System 1 provides: - Fast reactive motor control (200 Hz, 5ms per action) - Continuous action output across all 35 DoF - Fine-grained force modulation for dexterous tasks - Compliance and safety through high-rate feedback

Training Data Pipeline: A high-quality dataset of ~500 hours of diverse teleoperated behaviors is collected using multiple robots and multiple human operators. An auto-labeling VLM generates hindsight natural language instructions for each demonstration, creating language-conditioned training pairs without manual annotation.

Dual-Robot Operation: For multi-robot tasks, each robot has its own System 1 + System 2 stack, but they share task-level context through language instructions describing the joint task. The robots do not directly communicate actions but coordinate through shared semantic understanding.

Results

Helix demonstrates several qualitative capabilities that represent firsts for VLA models:

Dexterous manipulation: Picks up, reorients, and places diverse objects using individual finger control
Novel object generalization: Manipulates objects never seen during training, leveraging VLM pre-training for zero-shot recognition
Long-horizon tasks: Completes multi-step manipulation sequences requiring task switching and error recovery
Dual-robot coordination: Two Figure 02 robots collaboratively solve tasks (e.g., handoff objects) guided by shared language instructions
Real-time onboard: All computation runs on the robot's embedded GPUs, with no cloud dependence

No standardized benchmark results are reported in the technical report; evaluation is primarily through real-world task demonstrations.

Limitations

No standardized benchmark evaluation (e.g., CALVIN, MetaWorld), making quantitative comparison with other VLAs difficult
System 2 VLM operates at only 7-9 Hz, creating potential latency in responding to novel situations
The ~500-hour training dataset is proprietary and specific to Figure's robots, limiting reproducibility
Dual-robot coordination relies on shared language context rather than explicit communication, which may not scale to more complex multi-agent scenarios
Lower-body locomotion is not addressed; Helix controls only the upper body

Connections

Directly extends the VLA paradigm from Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control to humanoid scale
The System 1/System 2 architecture echoes the dual-system design in Groot N1 An Open Foundation Model For Generalist Humanoid Robots (GR00T N1), which also separates high-level reasoning from fast motor control
Shares the VLM backbone approach of Gemini Robotics Bringing Ai Into The Physical World (Gemini Robotics) but targets humanoid rather than general manipulation
Auto-labeled training data connects to the approach of Openvla An Open Source Vision Language Action Model which trains on diverse robot data
The video-diffusion approach of Video Prediction Policy A Generalist Robot Policy With Predictive Visual Representations (VPP) offers an alternative to Helix's VLM-based backbone
High-frequency control requirements connect to the adaptive reasoning in driving VLAs like Autovala Vision Language Action Model For End To End Autonomous Driving