ESC

Tags

239 tags across the wiki

paper 114 autonomous-driving 92 foundation-model 55 transformer 53 vla 49 planning 42 robotics 41 computer-vision 36 perception 32 ilya-30 29 multimodal 29 nlp 26 end-to-end 24 language-modeling 24 llm 17 reasoning 17 imitation-learning 16 3d-occupancy 15 vlm 15 bev 14 diffusion 13 e2e 12 reinforcement-learning 12 world-model 12 chain-of-thought 10 benchmark 9 scaling 9 cross-embodiment 7 driving 6 gaussian-splatting 6 generative-models 6 image-classification 6 information-theory 6 questions 6 self-supervised 6 sources 6 alignment 5 attention 5 cnn 5 foundation 5 knowledge-distillation 5 language-model 5 prediction 5 simulation 5 evaluation 4 image-generation 4 instruction-tuning 4 mixture-of-experts 4 rnn 4 sequence-to-sequence 4 sparse-representation 4 video-prediction 4 explainability 3 flow-matching 3 lstm 3 map 3 occupancy 3 open-source 3 semantic-segmentation 3 sequence-modeling 3 trajectory-prediction 3 vectorized-representation 3 3d-detection 2 3d-perception 2 3d-reconstruction 2 action-representation 2 autonomy 2 autoregressive 2 bimanual 2 closed-loop 2 complexity-theory 2 dataset 2 deployment 2 distributed-training 2 efficient-inference 2 embodied 2 fine-tuning 2 foundation-models 2 foundational 2 gaussian-representation 2 generation 2 generative 2 human-interaction 2 humanoid 2 manipulation 2 memory-augmented-networks 2 ml 2 multi-camera 2 multilingual 2 object-detection 2 parameter-efficient-fine-tuning 2 prompting 2 real-time 2 regularization 2 relational-reasoning 2 residual-networks 2 rlhf 2 scaling-laws 2 segmentation 2 self-improvement 2 self-supervised-learning 2 state-space 2 systems 2 thermodynamics 2 vision-language-model 2 vision-transformer 2 visual-question-answering 2 zero-shot 2 3d 1 3d-scene 1 3d-semantic-occupancy 1 agenda 1 agentic 1 agi 1 algorithmic-information-theory 1 algorithmic-randomness 1 asynchronous 1 attention-mechanism 1 batch 1 bayesian-inference 1 behavior-forecasting 1 camera-fusion 1 classifier-guidance 1 combinatorial-optimization 1 comparison 1 compression 1 computability 1 concept 1 contrastive-learning 1 control 1 convolutional-neural-networks 1 corpus 1 course 1 data-collection 1 decoupled 1 deep-learning 1 denoising 1 depth-estimation 1 dexterous-manipulation 1 differentiable-programming 1 diffusion-policy 1 diffusion-transformer 1 dilated-convolutions 1 dropout 1 efficient 1 embodied-ai 1 embodiment 1 emergent-abilities 1 end-to-end-learning 1 evaluation-metric 1 few-shot 1 few-shot-learning 1 foundations 1 frontend 1 gaussian 1 gaussian-rendering 1 generalist-agent 1 generalization 1 gpu-training 1 graph-neural-networks 1 grounding 1 grpo 1 hierarchical 1 high-frequency-control 1 hosting 1 ilya 1 image-captioning 1 image-text-retrieval 1 in-context-learning 1 inductive-bias 1 intelligence-measurement 1 interactive-annotation 1 interactive-segmentation 1 knowledge-preservation 1 kolmogorov-complexity 1 lanegcn 1 locomotion 1 machine-translation 1 mamba 1 mdl 1 message-passing 1 minimum-description-length 1 model-parallelism 1 model-predictive-control 1 model-selection 1 modular 1 molecular-property-prediction 1 multi-embodiment 1 multi-task 1 natural-language 1 neural-radiance-fields 1 neuro-symbolic 1 obsidian 1 open-world 1 optimization 1 orchestration 1 parallel-architecture 1 parameter-efficient 1 permutation-invariance 1 personalization 1 physical-ai 1 pipeline-parallelism 1 pointer-mechanism 1 privileged-supervision 1 probabilistic-planning 1 proprioception 1 quantization 1 queue 1 radar 1 recurrent-neural-networks 1 representation-learning 1 scene-understanding 1 search 1 seminal 1 sensor-fusion 1 set-modeling 1 siamese-networks 1 simulator 1 source 1 sparse-models 1 spatial-reasoning 1 speech-recognition 1 survey 1 synthesis 1 taxonomy 1 temporal 1 temporal-modeling 1 thesis 1 tokenization 1 tool-use 1 training 1 uniad 1 unified-stack 1 vanishing-gradients 1 variational-autoencoders 1 video-generation 1 video-understanding 1 visual-traces 1 vit 1

Pages tagged multimodal

3D-VLA: A 3D Vision-Language-Action Generative World Model
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

Dima Distilling Multi Modal Large Language Models For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.09757)** DiMA addresses the core tension in autonomous driving between vision-based planners (efficient but fragile on rare scenarios) and LLM-based approaches (strong reas…

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2309.10228)** Drive as You Speak (DAYS) proposes a framework for enabling natural language interaction between human passengers and autonomous vehicles using large language mode…

Flamingo: a Visual Language Model for Few-Shot Learning
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…

Gemini Robotics Bringing Ai Into The Physical World
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…

Gemma 3 Technical Report
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…

GPT-4 Technical Report
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…

Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
paper

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.06978)** Hydra-MDP addresses a fundamental limitation of imitation learning for autonomous driving: standard behavior cloning learns only to mimic human demo…

Languagempc Large Language Models As Decision Makers For Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.03026)** LanguageMPC addresses a fundamental limitation in autonomous driving: traditional planners (MPC, RL) struggle with complex scenarios that require high-level reason…

Learning Transferable Visual Models From Natural Language Supervision
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

Octo An Open Source Generalist Robot Policy
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

On The Opportunities And Risks Of Foundation Models
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

Robocat A Self Improving Generalist Agent For Robotic Manipulation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

RoboVLMs: What Matters in Building Vision-Language-Action Models
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

RT-H: Action Hierarchies Using Language
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

Unisim Learning Interactive Real World Simulators
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

Vision Language Action
concept

This page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025. A VLA system consumes visual context and language-conditioned intent, then…

Visual Instruction Tuning (LLaVA)
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2307.05973)** VoxPoser addresses a fundamental bottleneck in robot manipulation: translating open-ended natural language instructions into precise physical actions without requi…