ESC

Open Questions: Foundation Models & Cross-Embodiment

Stream-specific open questions for foundation models, scaling, and cross-embodiment transfer. See Open Questions for the full tree across all streams.

Scaling and efficiency

  1. Compute-optimal scaling for embodied AI: Kaplan scaling laws and Chinchilla established compute-optimal ratios for language. HPT shows scaling laws for robot pretraining. But do these laws hold for multimodal embodied data (images + proprioception + actions), or does the data mixture change the optimal ratio?

  2. Open vs. closed model trajectory: Llama 2, Mistral 7B, OpenVLA, and Octo catalyzed more downstream work than closed counterparts. Qwen3 and Gemma 3 now match frontier closed models. Will open-source maintain this acceleration, or will proprietary data advantages (Waymo driving logs, Tesla fleet data) create an insurmountable moat for driving?

  3. Adaptation efficiency: LoRA (29K citations) and QLoRA made efficient adaptation standard. Prefix-Tuning pioneered continuous prompt optimization. Is parameter-efficient fine-tuning sufficient for embodied domains, or do physical tasks require more extensive adaptation than language tasks?

  4. Distillation as deployment strategy: Gemma 3's 4B matches prior-generation 27B via distillation. DeepSeek-R1 distills reasoning to 1.5B. DiMA distills LLM reasoning into a vision planner. Is train-large-distill-small the universal deployment pattern for safety-critical systems?

Cross-embodiment transfer

  1. Embodiment scaling laws: CrossFormer (20+ embodiments) and embodiment scaling laws show returns from embodiment diversity. Is there an optimal diversity-depth trade-off — how many embodiments vs. how much data per embodiment?

  2. Action space universality: UniAct proposes universal action representations. FAST introduces DCT+BPE tokenization. Can a single action representation truly span manipulation, navigation, locomotion, and driving, or are domain-specific action spaces necessary?

  3. World model as universal simulator: Cosmos and UniSim train world models on internet-scale video for physical AI simulation. Can learned world models replace engineered simulators for training embodied agents, or do they introduce systematic biases?

Multimodal reasoning

  1. Vision-language alignment quality: CLIP (58K citations) established contrastive vision-language alignment. LLaVA (13K+ citations) added instruction tuning. BLIP unified understanding + generation. Is current VL alignment sufficient for safety-critical spatial reasoning, or is there a fundamental "grounding gap" between CLIP-style alignment and true 3D spatial understanding?

  2. Emergent capabilities and risks: The Foundation Models report warned about emergence and homogenization risks. GPT-4 demonstrated surprising emergent capabilities. As driving models scale, will we see emergent driving capabilities (handling novel scenarios) or emergent failures (systematic blind spots)?

  3. Alignment for physical systems: InstructGPT and DPO align LLMs to human preferences. Driving alignment requires physical safety guarantees. Can RLHF/DPO techniques transfer to physical AI, or do we need fundamentally new alignment methods?

Partially answered

  • Q2 (Open vs. closed): The open-source acceleration is clear empirically. Qwen3 and Gemma 3 matching frontier models in 2025 suggests convergence is accelerating. But driving-specific data remains a differentiator.
  • Q4 (Distillation): Strong evidence from Gemma 3, R1, and DiMA that distillation works across domains. The pattern is converging toward train-large-distill-small as standard practice.
  • Q8 (VL alignment): CLIP → LLaVA → SAM trajectory shows alignment is improving. But SAM's promptable segmentation and DINO's emergent properties suggest 2D alignment may not be enough for 3D spatial reasoning needed in driving.

Key papers for this stream

Paper Relevance
On The Opportunities And Risks Of Foundation Models Foundational framework: emergence + homogenization
Scaling Laws For Neural Language Models Scaling laws for language
Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers Scaling laws for heterogeneous robots
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation One policy for 20+ embodiments
Lora Low Rank Adaptation Of Large Language Models Dominant adaptation method
Learning Transferable Visual Models From Natural Language Supervision CLIP: vision-language alignment
Cosmos World Foundation Model Platform For Physical Ai World foundation model platform
Unisim Learning Interactive Real World Simulators Interactive real-world simulators
Qwen3 Technical Report Open-weight reasoning models
Gemma 3 Technical Report Distillation-driven efficiency