Tags

3D-VLA: A 3D Vision-Language-Action Generative World Model

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

Dima Distilling Multi Modal Large Language Models For Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.09757)** DiMA addresses the core tension in autonomous driving between vision-based planners (efficient but fragile on rare scenarios) and LLM-based approaches (strong reas…

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2309.10228)** Drive as You Speak (DAYS) proposes a framework for enabling natural language interaction between human passengers and autonomous vehicles using large language mode…

Flamingo: a Visual Language Model for Few-Shot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…

Gemini Robotics Bringing Ai Into The Physical World

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…

Gemma 3 Technical Report

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…

GPT-4 Technical Report

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…

Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation

paper

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.06978)** Hydra-MDP addresses a fundamental limitation of imitation learning for autonomous driving: standard behavior cloning learns only to mimic human demo…

Languagempc Large Language Models As Decision Makers For Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.03026)** LanguageMPC addresses a fundamental limitation in autonomous driving: traditional planners (MPC, RL) struggle with complex scenarios that require high-level reason…

Learning Transferable Visual Models From Natural Language Supervision

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

Octo An Open Source Generalist Robot Policy

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

On The Opportunities And Risks Of Foundation Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

Robocat A Self Improving Generalist Agent For Robotic Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

RoboVLMs: What Matters in Building Vision-Language-Action Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

RT-H: Action Hierarchies Using Language

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

Unisim Learning Interactive Real World Simulators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

Vision Language Action

concept

This page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025. A VLA system consumes visual context and language-conditioned intent, then…

Visual Instruction Tuning (LLaVA)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2307.05973)** VoxPoser addresses a fundamental bottleneck in robot manipulation: translating open-ended natural language instructions into precise physical actions without requi…

Pages tagged multimodal