ESC

RoboVLMs: What Matters in Building Vision-Language-Action Models

πŸ“„ Read on arXiv

Overview

RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles for building effective Vision-Language-Action (VLA) models. While VLA models have emerged as a compelling paradigm for generalist robot control -- leveraging powerful vision-language representations from pretrained VLMs -- the field has lacked systematic understanding of which design choices actually matter. RoboVLMs addresses this gap through over 600 experiments spanning VLM backbone selection, VLA architectural formulations, action space design, and cross-embodiment data strategies.

The study introduces the RoboVLMs framework, a flexible codebase supporting systematic comparison across eight VLM backbones (3B to 9B parameters, including Flamingo, LLaVA, Qwen-VL, MoonDream, UForm, KosMos, and PaliGemma) and four VLA formulations (one-step continuous, one-step discrete, interleaved-continuous history, and policy-head-continuous history). Evaluation covers both simulation benchmarks (CALVIN, SimplerEnv) and real-world experiments on a 7-DoF Kinova Gen3 robot with 100 tasks and 74K trajectories.

The key findings provide concrete, actionable guidance: KosMos and PaliGemma backbones significantly outperform alternatives due to comprehensive vision-language pretraining; continuous action spaces consistently beat discrete autoregressive tokenization, especially on longer-horizon tasks; policy-head architectures for history integration outperform interleaved approaches; and a post-training strategy (pretrain on cross-embodiment data, fine-tune on target domain) is the most effective way to leverage heterogeneous robot datasets. The best-performing RoboVLM configuration achieves state-of-the-art results with a 12.6% absolute gain on single tasks and 30.3% improvement on 5 consecutive tasks on the CALVIN unseen split, with emergent self-correction behavior in real-world settings.

Key Contributions

  • Systematic VLM backbone comparison: Evaluated 8 VLM backbones (3B-9B) across all VLA formulations, finding KosMos and PaliGemma are consistently best due to rich vision-language pretraining -- settling a previously unclear design choice
  • Action space analysis: Demonstrated that continuous actions consistently outperform discrete autoregressive tokenization, particularly for long-horizon tasks where discretization errors compound
  • History integration architecture: Found that policy-head structures (external temporal aggregation modules) outperform interleaved token approaches, preserving the original VLM's vision-language fusion while adding temporal reasoning
  • Cross-embodiment data strategy: Established that post-training (cross-embodiment pretrain then target fine-tune) is the optimal strategy, vs. co-training or target-only training
  • Open-source framework: Released the RoboVLMs codebase enabling reproducible VLA research across backbone and architecture combinations

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               RoboVLMs Framework: 4 Formulations           β”‚
β”‚                                                            β”‚
β”‚  Images + Language ──► VLM Backbone (8 options tested)     β”‚
β”‚                        β”‚                                   β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚         β”‚              β”‚              β”‚                     β”‚
β”‚    One-Step        One-Step      History Models             β”‚
β”‚    Continuous      Discrete      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚         β”‚              β”‚         β”‚                     β”‚    β”‚
β”‚         β–Ό              β–Ό         β–Ό                     β–Ό    β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚    β”‚MLP Head β”‚   β”‚LM Head  β”‚  β”‚Interleavedβ”‚   β”‚Policy    β”‚β”‚
β”‚    β”‚β†’ cont.  β”‚   β”‚β†’ action β”‚  β”‚tokens in  β”‚   β”‚Head      β”‚β”‚
β”‚    β”‚7-DoF    β”‚   β”‚tokens   β”‚  β”‚VLM contextβ”‚   β”‚(separate β”‚β”‚
β”‚    β”‚action   β”‚   β”‚(RT-2    β”‚  β”‚+ cont.    β”‚   β”‚temporal  β”‚β”‚
β”‚    β”‚         β”‚   β”‚ style)  β”‚  β”‚action headβ”‚   β”‚module)   β”‚β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                            β”‚
β”‚  Best config: PaliGemma/KosMos + Policy-Head + Continuous  β”‚
β”‚  Data strategy: Pretrain cross-embodiment β†’ fine-tune      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Method

The RoboVLMs framework supports four VLA formulations built on a common VLM backbone:

One-step models process only the current observation: - Continuous-Action: VLM features are passed through an MLP action head that directly regresses continuous 7-DoF actions (6-DoF pose + gripper) - Discrete-Action: Actions are discretized into tokens and predicted autoregressively by the VLM's language head, following the RT-2 paradigm

History models incorporate temporal context from past observations: - Interleaved-Continuous: Multiple timesteps of images and proprioception are interleaved as tokens in the VLM's context window, with a continuous action head - Policy-Head-Continuous: Current observation is processed by the VLM, but a separate policy head (e.g., a small transformer or MLP) aggregates VLM features with historical embeddings to produce actions

The system processes multi-camera RGB images and proprioceptive state, outputting 7-dimensional action vectors. Eight VLM backbones are tested: variants from the Flamingo family, LLaVA, Qwen-VL, MoonDream, UForm, KosMos, and PaliGemma.

Key findings on architecture

Design Choice Best Option Why
VLM backbone KosMos, PaliGemma Comprehensive V-L pretraining on large-scale datasets
Action space Continuous Avoids discretization error accumulation over long horizons
History integration Policy-Head Preserves VLM fusion; adds temporal context externally
Cross-embodiment data Post-training Pre-train cross-embodiment, fine-tune on target domain

Results

The best RoboVLM configuration achieves state-of-the-art performance on CALVIN:

Method 1 Task 2 Tasks 3 Tasks 4 Tasks 5 Tasks
RoboVLM (best) +12.6% abs β€” β€” β€” +30.3% abs
Prior SOTA baseline β€” β€” β€” baseline

(Relative improvements reported over prior SOTA on CALVIN ABC->D unseen split)

Key ablation insights

  1. Backbone matters most: Switching from the weakest to strongest VLM backbone yields larger gains than any architectural change, underscoring that the quality of pretrained visual-linguistic representations is the dominant factor
  2. Continuous vs. discrete: The gap widens on longer-horizon tasks (5 consecutive tasks), where discrete tokenization accumulates quantization error across sequential action predictions
  3. Post-training > co-training: Simply mixing cross-embodiment data during training hurts performance on the target domain due to distribution mismatch; the two-stage post-training approach resolves this

Real-world results

Real-world experiments on a 7-DoF Kinova Gen3 demonstrate: - Generalization to unseen distractors, backgrounds, and target objects - Robustness to novel natural-language skill descriptions - Emergent self-correction: the model recovers from intermediate failures without explicit recovery training

Limitations & Open Questions

  • Scale ceiling: All tested backbones are 3B-9B parameters; whether the findings (e.g., backbone ranking) hold at larger scales (30B+) is unknown
  • Action representation: The study compares continuous vs. discrete actions and includes flow matching as an evaluated training objective (finding no significant gain over MSE+BCE for short-horizon tasks), but does not evaluate structured alternatives like VQ codebooks (UniAct)
  • Single-arm focus: Experiments use a single 7-DoF manipulator; generalization to bimanual, mobile, or humanoid embodiments is untested
  • Real-time deployment: Inference speed and latency are not systematically benchmarked across configurations
  • Does backbone ranking transfer across tasks? KosMos and PaliGemma excel on manipulation -- unclear if this holds for navigation, locomotion, or driving

Connections

Related papers in the wiki: - Rt 2 Vision Language Action Models Transfer Web Knowledge To Robotic Control β€” RT-2 established the VLA paradigm that RoboVLMs systematically ablates; the discrete action tokenization approach tested here directly follows RT-2's design - Openvla An Open Source Vision Language Action Model β€” OpenVLA is one of the key baselines; RoboVLMs demonstrates that backbone and architecture choices can significantly outperform OpenVLA's fixed design - Pi0 A Vision Language Action Flow Model For General Robot Control β€” pi0 uses PaliGemma (one of RoboVLMs' top backbones) with flow matching actions, validating the backbone finding while exploring a different action space - Ecot Embodied Chain Of Thought Reasoning For Vision Language Action Models β€” ECoT adds chain-of-thought reasoning to VLAs; RoboVLMs' history integration findings are complementary - Uniact Universal Actions For Enhanced Embodied Foundation Models β€” UniAct's VQ codebook action space is an alternative to the continuous/discrete dichotomy studied here - Smolvla A Vision Language Action Model For Affordable Robotics β€” SmolVLA shows a 450M model can compete with 3B+ models, raising questions about whether RoboVLMs' backbone rankings hold at smaller scales - Fast Efficient Action Tokenization For Vision Language Action Models β€” FAST proposes DCT+BPE action tokenization as a third alternative to the continuous/discrete options evaluated here - Vision Language Action β€” broader VLA paradigm context - Robotics β€” robotics VLA landscape