Open Questions: Vision-Language-Action Models
Stream-specific open questions for the VLA pillar. See Open Questions for the full tree across all streams.
Architecture and scaling
-
Dual-system generality: The dual-system pattern (slow VLM at 7-10 Hz + fast motor policy at 120-200 Hz) independently emerged at Physical Intelligence (pi0), Google DeepMind (Gemini Robotics), NVIDIA (GR00T N1), and Figure AI (Helix). Is this the converged architecture, or will single-rate systems eventually dominate?
-
Cross-embodiment scaling limits: CrossFormer trains one policy across 20+ embodiments. HPT shows scaling laws for heterogeneous robot pretraining. But do these scaling laws hold across dramatically different action spaces (manipulation vs. driving vs. locomotion), or do they plateau?
-
Action tokenization: FAST's DCT+BPE achieves 5x faster VLA training. pi0 uses flow matching for continuous actions. EMMA tokenizes trajectories as language. What is the right action representation — continuous, discrete vocabulary, or hybrid?
-
VLM knowledge preservation: Knowledge Insulation shows VLA training degrades VLM capabilities. How much VLM knowledge is actually needed for embodied control, and what is the optimal trade-off between language understanding and motor competence?
Training and improvement
-
RL for VLAs: pi0.6 doubled task throughput via offline RL self-improvement. DeepSeek-R1 showed RL with rule-based rewards produces emergent reasoning. Can driving VLAs benefit from GRPO-style RL, or does the lack of clean reward functions (unlike math/code) make this fundamentally harder?
-
Embodied chain-of-thought: ECoT increased VLA success by 28% through embodied reasoning. Does chain-of-thought reasoning scale to 50+ Hz control, or is it inherently a slow-system capability?
-
Open-world generalization: pi0.5 generalizes to unseen homes for 10-15 minute tasks. What is the failure mode — perception (novel objects), reasoning (novel situations), or motor control (novel physical interactions)?
Robotics → driving transfer
-
VLA transfer to driving: VoxPoser demonstrates LLMs can compose spatial objectives for manipulation without robot-specific training. The "LLM writes code to define value maps" paradigm works for tabletop manipulation — does it transfer to high-speed multi-agent driving?
-
Action abstraction gap: Robotics VLAs operate on joint angles/end-effector poses. Driving VLAs operate on trajectories/waypoints/controls. How much of VLA progress transfers when the action abstraction is fundamentally different?
-
Speed regime mismatch: Robotics VLAs can recover from errors at low speed. Driving at 60+ mph leaves no recovery margin. Does this fundamentally change the architecture requirements, or is it just a latency constraint?
Partially answered
- Q1 (Dual-system): Four independent convergences strongly suggest this is the right architecture for now. But single-rate 50 Hz models (pi0 flow matching) show the boundary is soft.
- Q5 (RL for VLAs): pi0.6 proves offline RL works for manipulation. AlphaDrive applies GRPO to driving VLMs. But reward design for driving safety/comfort remains the bottleneck.
- Q8 (VoxPoser transfer): 3D value map composition is an alternative to E2E VLAs, avoiding task-specific training entirely. Whether it scales to driving's complexity and speed is untested.
Key papers for this stream
| Paper | Relevance |
|---|---|
| Pi0 A Vision Language Action Flow Model For General Robot Control | Reference VLA: flow matching, 7 robots, 68 tasks |
| Pi06 A Vla That Learns From Experience | RL self-improvement for VLAs |
| Openvla An Open Source Vision Language Action Model | Open-source VLA baseline |
| Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation | Cross-embodiment scaling |
| Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers | Heterogeneous pretraining scaling laws |
| Octo An Open Source Generalist Robot Policy | First open generalist robot policy |
| Ecot Embodied Chain Of Thought Reasoning For Vision Language Action Models | Embodied chain-of-thought |
| Groot N1 An Open Foundation Model For Generalist Humanoid Robots | Humanoid foundation model |
| Gemini Robotics Bringing Ai Into The Physical World | Gemini for physical robotics |
| Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models | LLM-composed spatial objectives |