Tags
239 tags across the wiki
Pages tagged vlm
Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…
Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…
📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…
📄 **[Read on arXiv](https://arxiv.org/abs/2402.12289)** DriveVLM proposes a hierarchical approach to integrating Vision-Language Models into autonomous driving, emphasizing scene understanding and multi-level planning r…
📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…
📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…
Foundation models -- large models pretrained on broad data and adapted to downstream tasks -- are reshaping autonomous driving. This page tracks how LLMs, VLMs, and diffusion models influence autonomy, and examines the…
📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…
📄 **[Read on arXiv](https://arxiv.org/abs/2303.03378)** PaLM-E is a 562-billion parameter embodied multimodal language model created by Google that injects continuous sensor observations (images, point clouds, robot sta…
📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…
📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…
This page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025. A VLA system consumes visual context and language-conditioned intent, then…
This queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied…
📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…