ESC

Open Questions: Vision-Language-Action Models

Stream-specific open questions for the VLA pillar. See Open Questions for the full tree across all streams.

Architecture and scaling

  1. Dual-system generality: The dual-system pattern (slow VLM at 7-10 Hz + fast motor policy at 120-200 Hz) independently emerged at Physical Intelligence (pi0), Google DeepMind (Gemini Robotics), NVIDIA (GR00T N1), and Figure AI (Helix). Is this the converged architecture, or will single-rate systems eventually dominate?

  2. Cross-embodiment scaling limits: CrossFormer trains one policy across 20+ embodiments. HPT shows scaling laws for heterogeneous robot pretraining. But do these scaling laws hold across dramatically different action spaces (manipulation vs. driving vs. locomotion), or do they plateau?

  3. Action tokenization: FAST's DCT+BPE achieves 5x faster VLA training. pi0 uses flow matching for continuous actions. EMMA tokenizes trajectories as language. What is the right action representation — continuous, discrete vocabulary, or hybrid?

  4. VLM knowledge preservation: Knowledge Insulation shows VLA training degrades VLM capabilities. How much VLM knowledge is actually needed for embodied control, and what is the optimal trade-off between language understanding and motor competence?

Training and improvement

  1. RL for VLAs: pi0.6 doubled task throughput via offline RL self-improvement. DeepSeek-R1 showed RL with rule-based rewards produces emergent reasoning. Can driving VLAs benefit from GRPO-style RL, or does the lack of clean reward functions (unlike math/code) make this fundamentally harder?

  2. Embodied chain-of-thought: ECoT increased VLA success by 28% through embodied reasoning. Does chain-of-thought reasoning scale to 50+ Hz control, or is it inherently a slow-system capability?

  3. Open-world generalization: pi0.5 generalizes to unseen homes for 10-15 minute tasks. What is the failure mode — perception (novel objects), reasoning (novel situations), or motor control (novel physical interactions)?

Robotics → driving transfer

  1. VLA transfer to driving: VoxPoser demonstrates LLMs can compose spatial objectives for manipulation without robot-specific training. The "LLM writes code to define value maps" paradigm works for tabletop manipulation — does it transfer to high-speed multi-agent driving?

  2. Action abstraction gap: Robotics VLAs operate on joint angles/end-effector poses. Driving VLAs operate on trajectories/waypoints/controls. How much of VLA progress transfers when the action abstraction is fundamentally different?

  3. Speed regime mismatch: Robotics VLAs can recover from errors at low speed. Driving at 60+ mph leaves no recovery margin. Does this fundamentally change the architecture requirements, or is it just a latency constraint?

Partially answered

  • Q1 (Dual-system): Four independent convergences strongly suggest this is the right architecture for now. But single-rate 50 Hz models (pi0 flow matching) show the boundary is soft.
  • Q5 (RL for VLAs): pi0.6 proves offline RL works for manipulation. AlphaDrive applies GRPO to driving VLMs. But reward design for driving safety/comfort remains the bottleneck.
  • Q8 (VoxPoser transfer): 3D value map composition is an alternative to E2E VLAs, avoiding task-specific training entirely. Whether it scales to driving's complexity and speed is untested.

Key papers for this stream

Paper Relevance
Pi0 A Vision Language Action Flow Model For General Robot Control Reference VLA: flow matching, 7 robots, 68 tasks
Pi06 A Vla That Learns From Experience RL self-improvement for VLAs
Openvla An Open Source Vision Language Action Model Open-source VLA baseline
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation Cross-embodiment scaling
Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers Heterogeneous pretraining scaling laws
Octo An Open Source Generalist Robot Policy First open generalist robot policy
Ecot Embodied Chain Of Thought Reasoning For Vision Language Action Models Embodied chain-of-thought
Groot N1 An Open Foundation Model For Generalist Humanoid Robots Humanoid foundation model
Gemini Robotics Bringing Ai Into The Physical World Gemini for physical robotics
Voxposer Composable 3D Value Maps For Robotic Manipulation With Language Models LLM-composed spatial objectives