ESC

Tags

239 tags across the wiki

paper 114 autonomous-driving 92 foundation-model 55 transformer 53 vla 49 planning 42 robotics 41 computer-vision 36 perception 32 ilya-30 29 multimodal 29 nlp 26 end-to-end 24 language-modeling 24 llm 17 reasoning 17 imitation-learning 16 3d-occupancy 15 vlm 15 bev 14 diffusion 13 e2e 12 reinforcement-learning 12 world-model 12 chain-of-thought 10 benchmark 9 scaling 9 cross-embodiment 7 driving 6 gaussian-splatting 6 generative-models 6 image-classification 6 information-theory 6 questions 6 self-supervised 6 sources 6 alignment 5 attention 5 cnn 5 foundation 5 knowledge-distillation 5 language-model 5 prediction 5 simulation 5 evaluation 4 image-generation 4 instruction-tuning 4 mixture-of-experts 4 rnn 4 sequence-to-sequence 4 sparse-representation 4 video-prediction 4 explainability 3 flow-matching 3 lstm 3 map 3 occupancy 3 open-source 3 semantic-segmentation 3 sequence-modeling 3 trajectory-prediction 3 vectorized-representation 3 3d-detection 2 3d-perception 2 3d-reconstruction 2 action-representation 2 autonomy 2 autoregressive 2 bimanual 2 closed-loop 2 complexity-theory 2 dataset 2 deployment 2 distributed-training 2 efficient-inference 2 embodied 2 fine-tuning 2 foundation-models 2 foundational 2 gaussian-representation 2 generation 2 generative 2 human-interaction 2 humanoid 2 manipulation 2 memory-augmented-networks 2 ml 2 multi-camera 2 multilingual 2 object-detection 2 parameter-efficient-fine-tuning 2 prompting 2 real-time 2 regularization 2 relational-reasoning 2 residual-networks 2 rlhf 2 scaling-laws 2 segmentation 2 self-improvement 2 self-supervised-learning 2 state-space 2 systems 2 thermodynamics 2 vision-language-model 2 vision-transformer 2 visual-question-answering 2 zero-shot 2 3d 1 3d-scene 1 3d-semantic-occupancy 1 agenda 1 agentic 1 agi 1 algorithmic-information-theory 1 algorithmic-randomness 1 asynchronous 1 attention-mechanism 1 batch 1 bayesian-inference 1 behavior-forecasting 1 camera-fusion 1 classifier-guidance 1 combinatorial-optimization 1 comparison 1 compression 1 computability 1 concept 1 contrastive-learning 1 control 1 convolutional-neural-networks 1 corpus 1 course 1 data-collection 1 decoupled 1 deep-learning 1 denoising 1 depth-estimation 1 dexterous-manipulation 1 differentiable-programming 1 diffusion-policy 1 diffusion-transformer 1 dilated-convolutions 1 dropout 1 efficient 1 embodied-ai 1 embodiment 1 emergent-abilities 1 end-to-end-learning 1 evaluation-metric 1 few-shot 1 few-shot-learning 1 foundations 1 frontend 1 gaussian 1 gaussian-rendering 1 generalist-agent 1 generalization 1 gpu-training 1 graph-neural-networks 1 grounding 1 grpo 1 hierarchical 1 high-frequency-control 1 hosting 1 ilya 1 image-captioning 1 image-text-retrieval 1 in-context-learning 1 inductive-bias 1 intelligence-measurement 1 interactive-annotation 1 interactive-segmentation 1 knowledge-preservation 1 kolmogorov-complexity 1 lanegcn 1 locomotion 1 machine-translation 1 mamba 1 mdl 1 message-passing 1 minimum-description-length 1 model-parallelism 1 model-predictive-control 1 model-selection 1 modular 1 molecular-property-prediction 1 multi-embodiment 1 multi-task 1 natural-language 1 neural-radiance-fields 1 neuro-symbolic 1 obsidian 1 open-world 1 optimization 1 orchestration 1 parallel-architecture 1 parameter-efficient 1 permutation-invariance 1 personalization 1 physical-ai 1 pipeline-parallelism 1 pointer-mechanism 1 privileged-supervision 1 probabilistic-planning 1 proprioception 1 quantization 1 queue 1 radar 1 recurrent-neural-networks 1 representation-learning 1 scene-understanding 1 search 1 seminal 1 sensor-fusion 1 set-modeling 1 siamese-networks 1 simulator 1 source 1 sparse-models 1 spatial-reasoning 1 speech-recognition 1 survey 1 synthesis 1 taxonomy 1 temporal 1 temporal-modeling 1 thesis 1 tokenization 1 tool-use 1 training 1 uniad 1 unified-stack 1 vanishing-gradients 1 variational-autoencoders 1 video-generation 1 video-understanding 1 visual-traces 1 vit 1

Pages tagged vla

3D-VLA: A 3D Vision-Language-Action Generative World Model
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

A Generalist Agent
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
source-summary

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
source-summary

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…

Autovala Vision Language Action Model For End To End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2506.13757)** AutoVLA presents a unified approach to autonomous driving that integrates vision, language understanding, and action generation within a single autoregressive mode…

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
source-summary

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
source-summary

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
source-summary

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…

DriveLM: Driving with Graph Visual Question Answering
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09245)** DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than re…

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.16278)** DriveMoE introduces a dual-level Mixture-of-Experts (MoE) architecture to driving Vision-Language-Action models. The key innovation is applying expert specializati…

EMMA: End-to-End Multimodal Model for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

End-to-end Driving via Conditional Imitation Learning
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1710.02410)** This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level comma…

FAST: Efficient Action Tokenization for Vision-Language-Action Models
source-summary

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

GPT-Driver: Learning to Drive with GPT
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.01415)** GPT-Driver reformulates autonomous driving motion planning as a language modeling problem. Scene context (object positions, velocities, lane geometry) and ego vehi…

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

Groot N1 An Open Foundation Model For Generalist Humanoid Robots
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…

Helix: A Vision-Language-Action Model for Generalist Humanoid Control
source-summary

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

Knowledge Insulating Vision-Language-Action Models
source-summary

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

Lmdrive Closed Loop End To End Driving With Large Language Models
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07488)** LMDrive is the first system to demonstrate and benchmark LLM-based driving in closed-loop simulation, introducing the LangAuto benchmark with ~64K instruction-foll…

Open Questions: Vision-Language-Action Models
query

Stream-specific open questions for the VLA pillar. See wiki/queries/open-questions for the full tree across all streams. 1. **Dual-system generality:** The dual-system pattern (slow VLM at 7-10 Hz + fast motor policy at…

Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.23463)** OpenDriveVLA introduces a Vision-Language Action model specifically designed for end-to-end autonomous driving. Unlike previous approaches that use VLMs as supplem…

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning
source-summary

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

OpenVLA: An Open-Source Vision-Language-Action Model
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…

pi*0.6: A VLA That Learns From Experience
source-summary

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

pi0.5: A Vision-Language-Action Model with Open-World Generalization
source-summary

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

pi0: A Vision-Language-Action Flow Model for General Robot Control
source-summary

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.03661)** Reason2Drive provides the largest reasoning chain dataset for driving (>600K video-text pairs from nuScenes, Waymo, and ONCE) and introduces an aggregated evaluati…

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

Robotic Control via Embodied Chain-of-Thought Reasoning
source-summary

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

RoboVLMs: What Matters in Building Vision-Language-Action Models
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

RT-1: Robotics Transformer for Real-World Control at Scale
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

RT-H: Action Hierarchies Using Language
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

Self-Improving Embodied Foundation Models
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
source-summary

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2505.16805) SOLVE proposes a synergistic framework that combines a Vision-Language Model (VLM) reasoning branch (SOLVE-VLM) with an end-to-end (E2E) driving network (SOLVE-E2E), con…

SpatialVLA: Exploring Spatial Representations for VLA Models
source-summary

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

Talk2Car: Taking Control of Your Self-Driving Car
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1909.10838)** For autonomous vehicles to be truly useful as personal transportation, passengers should be able to issue natural-language commands like "park behind that blue car…

Textual Explanations for Self-Driving Vehicles
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1807.11546)** End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et…

Vision Language Action
concept

This page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025. A VLA system consumes visual context and language-conditioned intent, then…

VLA and Driving
source-program

This queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied…

VLP: Vision Language Planning for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01941)** End-to-end driving models typically output a single trajectory and trust it entirely, with no mechanism to evaluate whether the predicted path is safe before execu…