ESC

Papers

198 paper summaries

SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving
2026 arXiv

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2603.29163)** SparseDriveV2 by Sun et al. (2026) pushes the performance boundary of scoring-based trajectory planning by demonstrating that "scoring is all you ne…

paper autonomous-driving end-to-end sparse-representation +1
DrivoR: Driving on Registers
2026 arXiv 3

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

paper autonomous-driving e2e perception +3
WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
2025 arXiv 81

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01941)** End-to-end driving models typically output a single trajectory and trust it entirely, with no mechanism to evaluate whether the predicted path is safe before execu…

paper autonomous-driving vla world-model +3
UniAct: Universal Actions for Enhanced Embodied Foundation Models
2025 CVPR 60

**[Read on arXiv](https://arxiv.org/abs/2501.10105)** UniAct addresses a critical challenge in embodied AI: robot action data suffers from severe heterogeneity across platforms, control interfaces, and physical embodime…

paper robotics foundation-model cross-embodiment +1
Towards Embodiment Scaling Laws in Robot Locomotion
2025 CoRL 10

**[Read on arXiv](https://arxiv.org/abs/2505.05753)** This paper investigates whether increasing robot diversity during training improves generalization to unseen robots, analogous to how data scaling improves language…

paper robotics scaling-laws locomotion +1
SpatialVLA: Exploring Spatial Representations for VLA Models
2025 arXiv 292

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

paper robotics vla spatial-reasoning +1
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
2025 CVPR 2025

[Read on arXiv](https://arxiv.org/abs/2505.16805) SOLVE proposes a synergistic framework that combines a Vision-Language Model (VLM) reasoning branch (SOLVE-VLM) with an end-to-end (E2E) driving network (SOLVE-E2E), con…

paper autonomous-driving vla chain-of-thought +1
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
2025 arXiv 224

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

paper robotics vla efficient +1
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
2025 CVPR 2025 89

📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…

paper autonomous-driving vla vlm +3
Self-Improving Embodied Foundation Models
2025 NeurIPS 2025 18

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

paper robotics foundation-model self-improvement +3
S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
2025 CVPR 16

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

paper autonomous-driving self-supervised multimodal +3
RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation
2025 ICLR

[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…

paper robotics diffusion bimanual +1
Qwen3 Technical Report
2025 arXiv 3706

📄 **[Read on arXiv](https://arxiv.org/abs/2505.09388)** Qwen3, developed by the Qwen team at Alibaba, represents a major step forward in open-weight language models by offering a comprehensive family spanning both dense…

nlp language-modeling transformer mixture-of-experts +4
Pseudo-Simulation for Autonomous Driving (NAVSIM v2)
2025 CoRL 2025 62

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2506.04218)** Pseudo-Simulation by Cao, Hallgarten et al. (Tubingen / Shanghai AI Lab / NVIDIA / Stanford, CoRL 2025) introduces a novel evaluation paradigm for a…

paper autonomous-driving benchmark simulation +1
pi0.5: A Vision-Language-Action Model with Open-World Generalization
2025 arXiv 681

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

paper robotics vla foundation-model +2
pi*0.6: A VLA That Learns From Experience
2025 arXiv 93

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

paper robotics vla reinforcement-learning +1
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
2025 arxiv 100

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…

paper autonomous-driving vla vlm +3
OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning
2025 arXiv 364

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

paper robotics vla fine-tuning +1
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
2025 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2503.23463)** OpenDriveVLA introduces a Vision-Language Action model specifically designed for end-to-end autonomous driving. Unlike previous approaches that use VLMs as supplem…

autonomous-driving vla end-to-end language-model
OccMamba: Semantic Occupancy Prediction with State Space Models
2025 CVPR 32

**[Read on arXiv](https://arxiv.org/abs/2408.09859)** OccMamba is the first Mamba-based network for semantic occupancy prediction, replacing transformer architectures' quadratic complexity with Mamba's linear complexity…

paper autonomous-driving perception 3d-occupancy +2
Momad Momentum Aware Planning In End To End Autonomous Driving
2025 CVPR 60

📄 **[Read on arXiv](https://arxiv.org/abs/2503.03125)** End-to-end autonomous driving systems suffer from a critical limitation: temporal inconsistency. Current systems operate in a "one-shot" manner, making trajectory…

autonomous-driving planning end-to-end trajectory-prediction
LAW: Enhancing End-to-End Autonomous Driving with Latent World Model
2025 ICLR

[Read on arXiv](https://arxiv.org/abs/2406.08481) LAW (CASIA, ICLR 2025) introduces a self-supervised latent world model that enhances end-to-end autonomous driving by learning to predict future latent states of the dri…

paper autonomous-driving world-model self-supervised +1
Knowledge Insulating Vision-Language-Action Models
2025 arXiv preprint

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

paper robotics vla knowledge-preservation +1
Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation
2025 arXiv 38

📄 **[Read on arXiv](https://arxiv.org/abs/2501.14729)** HERMES tackles a fundamental limitation in autonomous driving: existing systems treat 3D scene understanding and future scene generation as separate problems. Driv…

autonomous-driving world-model 3d-scene perception +1
Helix: A Vision-Language-Action Model for Generalist Humanoid Control
2025 Figure AI Technical Report

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

paper robotics vla humanoid +1
Groot N1 An Open Foundation Model For Generalist Humanoid Robots
2025 arXiv 602

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…

robotics foundation-model vla humanoid
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation
2025 CVPR

[Read on arXiv](https://arxiv.org/abs/2503.05689) GoalFlow (Horizon Robotics / HKU, CVPR 2025) introduces a goal-driven flow matching framework for multimodal trajectory generation in autonomous driving. The method achi…

paper autonomous-driving flow-matching planning +1
Gemma 3 Technical Report
2025 arXiv 1120

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…

transformer language-modeling multimodal foundation-model +4
Gemini Robotics Bringing Ai Into The Physical World
2025 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…

robotics foundation-model multimodal reasoning
Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities
2025 arXiv 1943

📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…

nlp multimodal foundation-model transformer +5
GaussRender: Learning 3D Occupancy with Gaussian Rendering
2025 ICCV 2025 13

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.05040)** GaussRender by Chambon et al. (Valeo AI / Sorbonne, ICCV 2025) introduces a plug-and-play training-time module that improves 3D occupancy prediction…

paper autonomous-driving perception 3d-occupancy +1
GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting
2025 CVPR 18

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01957)** Bird's-Eye View (BEV) perception faces a fundamental trade-off between accuracy and computational efficiency. High-performing 3D projection methods like BEVFormer…

paper autonomous-driving bev perception +2
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
2025 ICCV 2025 19

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.17288)** GaussianFlowOcc (ICCV 2025) introduces a transformative approach to 3D semantic occupancy estimation for autonomous driving by replacing traditional…

paper autonomous-driving perception 3d-occupancy +2
FAST: Efficient Action Tokenization for Vision-Language-Action Models
2025 RSS 2025 353

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

paper robotics vla tokenization +1
EMMA: End-to-End Multimodal Model for Autonomous Driving
2025 TMLR 150

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

paper autonomous-driving vla vlm +3
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
2025 ICLR 2025 91

📄 **[Read on arXiv](https://arxiv.org/abs/2503.07656)** DriveTransformer represents a fundamental departure from existing end-to-end autonomous driving approaches. Rather than following sequential perception-prediction-…

autonomous-driving transformer end-to-end planning
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
2025 arXiv 55

📄 **[Read on arXiv](https://arxiv.org/abs/2505.16278)** DriveMoE introduces a dual-level Mixture-of-Experts (MoE) architecture to driving Vision-Language-Action models. The key innovation is applying expert specializati…

paper autonomous-driving vla mixture-of-experts +3
DriveGPT: Scaling Autoregressive Behavior Models for Driving
2025 ICML

[Read on arXiv](https://arxiv.org/abs/2412.14415) DriveGPT (Cruise, ICML 2025) is the first work to systematically study scaling laws for autoregressive behavior models in autonomous driving. Drawing inspiration from th…

paper autonomous-driving scaling-laws autoregressive +1
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
2025 ICCV 54

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

paper robotics vla diffusion-transformer +1
Dima Distilling Multi Modal Large Language Models For Autonomous Driving
2025 CVPR 34

📄 **[Read on arXiv](https://arxiv.org/abs/2501.09757)** DiMA addresses the core tension in autonomous driving between vision-based planners (efficient but fragile on rare scenarios) and LLM-based approaches (strong reas…

autonomous-driving knowledge-distillation multimodal language-model
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
2025 CVPR

[Read on arXiv](https://arxiv.org/abs/2411.15139) DiffusionDrive (HUST/Horizon Robotics, CVPR 2025 Highlight) proposes a truncated diffusion model for end-to-end autonomous driving that achieves real-time inference whil…

paper autonomous-driving diffusion end-to-end +1
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
2025 arXiv 140

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

paper robotics vla diffusion +2
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning
2025 arXiv 1920

📄 **[Read on arXiv](https://arxiv.org/abs/2501.12948)** DeepSeek-R1 demonstrates that sophisticated reasoning capabilities -- including self-verification, reflection, and extended chain-of-thought -- can emerge in large…

nlp reinforcement-learning language-modeling reasoning +4
Cosmos World Foundation Model Platform For Physical Ai
2025 arXiv 515

📄 **[Read on arXiv](https://arxiv.org/abs/2501.03575)** The Cosmos World Foundation Model Platform addresses Physical AI's critical challenge: the scarcity of safe, high-quality training data. By providing high-fidelity…

world-model foundation-model simulation physical-ai
CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving
2025 CVPR

[Read on arXiv](https://arxiv.org/abs/2502.19908) CarPlanner (Zhejiang University + Cainiao Network, CVPR 2025) introduces a consistent autoregressive reinforcement learning planner that is the first RL-based planner to…

paper autonomous-driving reinforcement-learning planning +1
BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction
2025 CVPR 22

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14182)** BridgeAD tackles a critical limitation in end-to-end autonomous driving: the ineffective utilization of historical temporal information. Current systems either agg…

paper autonomous-driving end-to-end prediction +2
BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
2025 CVPR 14

**[Read on arXiv](https://arxiv.org/abs/2502.19694)** BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the l…

paper autonomous-driving perception bev +2
Autovala Vision Language Action Model For End To End Autonomous Driving
2025 arXiv 110

📄 **[Read on arXiv](https://arxiv.org/abs/2506.13757)** AutoVLA presents a unified approach to autonomous driving that integrates vision, language understanding, and action generation within a single autoregressive mode…

autonomous-driving vla reinforcement-learning end-to-end +1
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
2025 arXiv 75

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…

paper autonomous-driving vla vlm +3
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
2025 arXiv 42

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…

paper autonomous-driving vla vlm +3
YOLOv10: Real-Time End-to-End Object Detection
2024 NeurIPS 2024 5988

📄 **[Read on arXiv](https://arxiv.org/abs/2405.14458)** Real-time object detection is critical infrastructure for autonomous driving, robotics, and augmented reality, yet the dominant YOLO family has long relied on non-…

computer-vision object-detection cnn end-to-end +2
VLP: Vision Language Planning for Autonomous Driving
2024 CVPR 155

📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…

paper autonomous-driving vla vlm +2
Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability
2024 NeurIPS 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17398)** Vista (NeurIPS 2024) is a generalizable driving world model that achieves high-fidelity video prediction at 10 Hz and 576x1024 resolution with versatile multi-moda…

autonomous-driving world-model diffusion video-prediction +3
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
2024 ICML 2025 (Spotlight) 139

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.14803)** Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future…

paper robotics video-prediction foundation-model +1
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
2024 arXiv 140

📄 **[Read on arXiv](https://arxiv.org/abs/2402.13243)** VADv2 by Chen et al. (2024) is the successor to VAD, addressing a fundamental limitation of deterministic planners in autonomous driving: they output a single traj…

autonomous-driving end-to-end planning vectorized-representation +2
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
2024 ICLR 2024 150

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

robotics transformer imitation-learning multimodal +3
Unisim Learning Interactive Real World Simulators
2024 ICLR 2024 (Oral) 200

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

world-model diffusion simulation robotics +4
Talk2Drive Towards Personalized Autonomous Driving With Large Language Models
2024 IEEE ITSC 2024 80

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09397)** Talk2Drive introduces an LLM-based framework for personalized autonomous driving through natural language interaction, demonstrated in real-world field experiments…

autonomous-driving llm planning nlp +2
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
2024 CVPR 50

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

autonomous-driving perception 3d-occupancy computer-vision +2
SparseOcc: Fully Sparse 3D Occupancy Prediction
2024 ECCV 80

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

autonomous-driving perception 3d-occupancy sparse-representation +3
SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation
2024 ICRA 2025 181

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2405.19620)** SparseDrive by Sun et al. (ICRA 2025) proposes a paradigm shift from dense BEV-based end-to-end driving to fully sparse scene representations. The c…

paper autonomous-driving end-to-end sparse-representation +1
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
2024 arXiv 102

📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…

paper autonomous-driving vla vlm +3
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
2024 CVPR 60

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12754)** SelfOcc (Huang et al., Tsinghua University, CVPR 2024) introduces the first self-supervised framework for vision-based 3D occupancy prediction that works with mult…

autonomous-driving perception 3d-occupancy self-supervised +3
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
2024 CoRL 2024 Oral 100

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

robotics transformer cross-embodiment imitation-learning +2
SAM 2: Segment Anything in Images and Videos
2024 arXiv (ECCV 2024 submission) 3925

📄 **[Read on arXiv](https://arxiv.org/abs/2408.00714)** SAM 2 extends the Segment Anything Model (SAM) from static image segmentation to unified promptable visual segmentation across both images and videos. Published by…

computer-vision segmentation foundation-model transformer +2
RT-H: Action Hierarchies Using Language
2024 RSS 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

robotics vla transformer imitation-learning +2
RoboVLMs: What Matters in Building Vision-Language-Action Models
2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

robotics vla transformer multimodal +2
Robotic Control via Embodied Chain-of-Thought Reasoning
2024 arXiv

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

paper robotics vla chain-of-thought +1
RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
2024 ICLR 2024 100

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

robotics vla imitation-learning multimodal +2
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
2024 CVPR 2025 15

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.12725)** RaCFormer by Chu et al. (USTC, CVPR 2025) addresses a fundamental problem in radar-camera fusion for 3D object detection: the image-to-BEV transform…

paper autonomous-driving perception radar +2
pi0: A Vision-Language-Action Flow Model for General Robot Control
2024 RSS 2025 1381

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

paper robotics vla foundation-model +1
PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving
2024 CVPR 2024

[Read on CVF Open Access](https://openaccess.thecvf.com/content/CVPR2024/html/Weng_PARA-Drive_Parallelized_Architecture_for_Real-time_Autonomous_Driving_CVPR_2024_paper.html) PARA-Drive (NVIDIA Research / USC / Stanford…

paper autonomous-driving end-to-end real-time +1
OpenVLA: An Open-Source Vision-Language-Action Model
2024 CoRL 1883

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

paper robotics vla open-source
Octo An Open Source Generalist Robot Policy
2024 RSS 400

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

robotics transformer foundation-model open-source +3
Occworld Learning A 3D Occupancy World Model For Autonomous Driving
2024 ECCV 198

📄 **[Read on arXiv](https://arxiv.org/abs/2311.16038)** OccWorld introduces a generative world model that operates in 3D semantic occupancy space, jointly forecasting future scene evolution and ego vehicle trajectories.…

autonomous-driving world-model 3d-occupancy planning
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
2024 ECCV 50

📄 **[Read on arXiv](https://arxiv.org/abs/2404.15014)** OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc,…

autonomous-driving perception 3d-occupancy diffusion +3
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
2024 NeurIPS 2024 100

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.15349)** Autonomous vehicle evaluation has long been split between two unsatisfying extremes: open-loop metrics that replay logged trajectories and compare p…

autonomous-driving benchmark simulation evaluation +2
Mixtral Of Experts
2024 arXiv 3089

📄 **[Read on arXiv](https://arxiv.org/abs/2401.04088)** Mixtral 8x7B, developed by Mistral AI, introduces a Sparse Mixture-of-Experts (SMoE) language model that achieves the quality of much larger dense models at a frac…

nlp language-modeling transformer mixture-of-experts +2
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2024 COLM 9619

📄 **[Read on arXiv](https://arxiv.org/abs/2312.00752)** Transformers have dominated sequence modeling since 2017, but their quadratic-complexity self-attention mechanism creates a fundamental bottleneck for long sequenc…

nlp language-modeling state-space sequence-modeling +1
Lmdrive Closed Loop End To End Driving With Large Language Models
2024 CVPR 294

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07488)** LMDrive is the first system to demonstrate and benchmark LLM-based driving in closed-loop simulation, introducing the LangAuto benchmark with ~64K instruction-foll…

paper autonomous-driving llm e2e +2
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
2024 ICML 2024 Spotlight 200

📄 **[Read on arXiv](https://arxiv.org/abs/2402.01817)** This paper by Subbarao Kambhampati and colleagues at Arizona State University addresses one of the most important questions in modern AI: can large language models…

nlp planning reasoning llm +2
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
2024 CoRL 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

robotics vla multimodal imitation-learning +3
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
2024 CVPR 2024

[Read on arXiv](https://arxiv.org/abs/2312.03031) This paper (CVPR 2024, NVIDIA / Nanjing University) delivers a "wake-up call" to the autonomous driving research community by demonstrating that simple baselines using o…

paper autonomous-driving evaluation benchmark +1
Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
2024 CVPR 2024 Autonomous Grand Challenge (1st place) 50

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.06978)** Hydra-MDP addresses a fundamental limitation of imitation learning for autonomous driving: standard behavior cloning learns only to mimic human demo…

autonomous-driving end-to-end planning knowledge-distillation +2
Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers
2024 NeurIPS 134

📄 **[Read on arXiv](https://arxiv.org/abs/2409.20537)** HPT tackles the fundamental challenge of building generalist robot representations that work across heterogeneous embodiments with different sensor configurations,…

robotics foundation-model cross-embodiment proprioception
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

robotics vla transformer foundation-model +4
GenAD: Generative End-to-End Autonomous Driving
2024 ECCV

[Read on arXiv](https://arxiv.org/abs/2402.11502) GenAD (ECCV 2024) reframes end-to-end autonomous driving as a generative modeling problem, simultaneously generating future trajectories for all traffic participants rat…

paper autonomous-driving end-to-end generative +1
Genad Generalized Predictive Model For Autonomous Driving
2024 CVPR 2024 Highlight

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09630)** > **Note:** This is the CVPR 2024 Highlight paper on large-scale video prediction for driving, NOT the ECCV 2024 paper wiki/sources/papers/genad-generative-end-to-…

autonomous-driving video-prediction diffusion foundation-model +2
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
2024 arXiv 41

**[Read on arXiv](https://arxiv.org/abs/2412.13193)** GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse…

paper autonomous-driving perception 3d-occupancy +2
Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction
2024 arXiv 2024 59

📄 **[Read on arXiv](https://arxiv.org/abs/2412.10373)** GaussianWorld introduces a world model paradigm for 3D occupancy prediction that explicitly models scene evolution over time, rather than treating frames as indepe…

autonomous-driving world-model 3d-occupancy gaussian-splatting +1
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
2024 arXiv 2024 47

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2408.11447)** GaussianOcc by Gan et al. (University of Tokyo / RIKEN / South China University of Technology / SIAT-CAS) is a systematic method that applies Gaussi…

paper autonomous-driving perception 3d-occupancy +2
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
2024 arXiv 57

**[Read on arXiv](https://arxiv.org/abs/2412.04384)** GaussianFormer-2 addresses 3D semantic occupancy prediction for vision-centric autonomous driving by rethinking how 3D Gaussians represent occupied space. The origin…

paper autonomous-driving perception 3d-occupancy +1
Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction
2024 ECCV 128

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17429)** GaussianFormer introduces a fundamentally different scene representation for 3D semantic occupancy prediction: instead of dense voxel grids, scenes are modeled as…

autonomous-driving perception 3d-occupancy gaussian-representation
GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation
2024 arXiv 20

📄 **[Read on arXiv](https://arxiv.org/abs/2407.14108)** Bird's-eye view (BEV) semantic segmentation from multi-camera images is a core perception task in autonomous driving, but existing image-to-BEV transformation meth…

autonomous-driving perception bev gaussian-splatting +2
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
2024 ICRA

[Read on arXiv](https://arxiv.org/abs/2310.01957) Driving with LLMs (Wayve, ICRA 2024) is one of the first concrete demonstrations of using a large language model as the decision-making "brain" for autonomous driving. T…

paper autonomous-driving language-model explainability +1
Driving Gaussian Composite Gaussian Splatting For Surrounding Dynamic Driving Scenes
2024 CVPR 398

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07920)** DrivingGaussian addresses photorealistic 3D scene reconstruction for dynamic autonomous driving environments using Gaussian splatting. The core challenge is that d…

autonomous-driving 3d-reconstruction gaussian-splatting simulation
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
2024 arXiv 416

📄 **[Read on arXiv](https://arxiv.org/abs/2402.12289)** DriveVLM proposes a hierarchical approach to integrating Vision-Language Models into autonomous driving, emphasizing scene understanding and multi-level planning r…

paper autonomous-driving vlm planning
DriveLM: Driving with Graph Visual Question Answering
2024 ECCV 448

📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…

paper autonomous-driving vlm reasoning +2
DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
2024 IEEE RA-L 576

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…

paper autonomous-driving vla vlm +2
DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving
2024 ECCV 2024

[Read on arXiv](https://arxiv.org/abs/2309.09777) DriveDreamer (ECCV 2024) is the first world model built entirely from real-world driving data, addressing fundamental limitations of prior approaches that relied on simu…

paper autonomous-driving world-model generation +1
Drive-OccWorld: Driving in the Occupancy World
2024 AAAI 2025 49

📄 **[Read on arXiv](https://arxiv.org/abs/2408.14197)** Drive-OccWorld introduces a vision-centric 4D occupancy forecasting world model that directly integrates with end-to-end planning. The core premise is that current…

paper autonomous-driving world-model 3d-occupancy +3
Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
2024 WACV 2025 Oral 30

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

autonomous-driving vla multimodal dataset +3
Bevnext Reviving Dense Bev Frameworks For 3D Object Detection
2024 CVPR 2024 80

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

autonomous-driving perception bev transformer +2
Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents
2024 arXiv 110

📄 **[Read on arXiv](https://arxiv.org/abs/2401.12963)** AutoRT addresses the critical data scarcity problem in robotics by using foundation models not as end-effectors but as intelligent orchestrators of large-scale rob…

robotics foundation-model orchestration data-collection
Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving
2024 ECCV 41

📄 **[Read on arXiv](https://arxiv.org/abs/2406.14556)** AsyncDriver addresses the practical deployment problem of LLM-enhanced driving planners: LLMs are too slow for frame-by-frame planning. The key insight is that hig…

autonomous-driving language-model planning asynchronous
Agent-Driver: A Language Agent for Autonomous Driving
2024 COLM 2024 140

📄 **[Read on arXiv](https://arxiv.org/abs/2311.10813)** Agent-Driver reframes autonomous driving as a cognitive agent problem, positioning a large language model as the central reasoning and planning engine rather than…

paper autonomous-driving llm planning +3
3D-VLA: A 3D Vision-Language-Action Generative World Model
2024 ICML 2024 140

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

robotics vla multimodal 3d-perception +3
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
2023 CoRL 2023 450

📄 **[Read on arXiv](https://arxiv.org/abs/2307.05973)** VoxPoser addresses a fundamental bottleneck in robot manipulation: translating open-ended natural language instructions into precise physical actions without requi…

robotics manipulation language-modeling multimodal +2
Visual Instruction Tuning (LLaVA)
2023 NeurIPS 2023 13533

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

multimodal vision-language-model instruction-tuning transformer +3
VAD: Vectorized Scene Representation for Efficient Autonomous Driving
2023 ICCV 567

📄 **[Read on arXiv](https://arxiv.org/abs/2303.12077)** VAD (Vectorized Scene Representation for Efficient Autonomous Driving) by Jiang et al. (ICCV 2023) is a pivotal paper in the shift from dense rasterized scene repr…

paper autonomous-driving planning vectorized-representation
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
2023 NeurIPS 2023 3561

📄 **[Read on arXiv](https://arxiv.org/abs/2305.10601)** Language models are typically used in a left-to-right token-generation mode, which limits their ability to explore alternative reasoning paths or backtrack from mi…

paper nlp reasoning language-modeling +4
Toolformer: Language Models Can Teach Themselves to Use Tools
2023 NeurIPS 2023 3994

📄 **[Read on arXiv](https://arxiv.org/abs/2302.04761)** Large language models exhibit remarkable in-context learning abilities but paradoxically struggle with tasks that are trivial for simple external tools -- arithmet…

nlp language-modeling tool-use transformer +2
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
2023 CVPR 2023 180

📄 **[Read on arXiv](https://arxiv.org/abs/2305.06242)** Think Twice (Jia et al., 2023) addresses a fundamental imbalance in end-to-end autonomous driving: while the community has invested heavily in sophisticated encode…

autonomous-driving end-to-end planning imitation-learning +2
SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving
2023 ICCV

📄 **[Read on arXiv](https://arxiv.org/abs/2303.09551)** SurroundOcc addresses the problem of dense 3D semantic occupancy prediction from multi-camera images for autonomous driving. Unlike 3D object detection, which repr…

autonomous-driving perception occupancy 3d-reconstruction +3
Segment Anything
2023 ICCV 2023 19692

📄 **[Read on arXiv](https://arxiv.org/abs/2304.02643)** Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmen…

computer-vision foundation-model segmentation transformer +2
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
2023 arXiv 2686

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

paper robotics vla embodied
Robocat A Self Improving Generalist Agent For Robotic Manipulation
2023 TMLR 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

robotics transformer imitation-learning multimodal +2
Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
2023 ECCV 107

📄 **[Read on arXiv](https://arxiv.org/abs/2312.03661)** Reason2Drive provides the largest reasoning chain dataset for driving (>600K video-text pairs from nuScenes, Waymo, and ONCE) and introduces an aggregated evaluati…

paper autonomous-driving vla reasoning +2
ReAct: Synergizing Reasoning and Acting in Language Models
2023 ICLR 2023 8533

📄 **[Read on arXiv](https://arxiv.org/abs/2210.03629)** Large language models had demonstrated two powerful capabilities in isolation: chain-of-thought reasoning for multi-step problem solving, and action generation for…

paper nlp reasoning language-modeling +3
Qlora Efficient Finetuning Of Quantized Language Models
2023 NeurIPS 2023 5975

📄 **[Read on arXiv](https://arxiv.org/abs/2305.14314)** Full fine-tuning of large language models requires enormous GPU memory -- a 65B-parameter model in 16-bit precision needs over 780 GB of GPU memory for parameters…

nlp transformer language-modeling foundation-model +2
Planning-oriented Autonomous Driving
2023 CVPR 1201

📄 **[Read on arXiv](https://arxiv.org/abs/2212.10156)** UniAD (Unified Autonomous Driving) is a planning-oriented end-to-end framework that unifies perception, prediction, and planning into a single differentiable netwo…

paper autonomous-driving uniad planning +1
PaLM-E: An Embodied Multimodal Language Model
2023 ICML 2491

📄 **[Read on arXiv](https://arxiv.org/abs/2303.03378)** PaLM-E is a 562-billion parameter embodied multimodal language model created by Google that injects continuous sensor observations (images, point clouds, robot sta…

paper robotics vlm embodied
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
2023 ICCV 280

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

autonomous-driving perception transformer computer-vision +3
Mistral 7B
2023 arXiv 4052

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06825)** Mistral 7B (Jiang et al., Mistral AI, 2023) challenged the prevailing assumption that larger language models are always better by demonstrating that a carefully de…

nlp language-modeling transformer foundation-model +1
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023 arXiv 22411

📄 **[Read on arXiv](https://arxiv.org/abs/2307.09288)** Llama 2 (Touvron et al., Meta AI, 2023) addresses the gap between open-source pretrained language models and polished, closed-source "product" LLMs like ChatGPT. W…

llm transformer foundation-model language-modeling +2
Languagempc Large Language Models As Decision Makers For Autonomous Driving
2023 arXiv 100

📄 **[Read on arXiv](https://arxiv.org/abs/2310.03026)** LanguageMPC addresses a fundamental limitation in autonomous driving: traditional planners (MPC, RL) struggle with complex scenarios that require high-level reason…

autonomous-driving llm planning nlp +3
GPT-Driver: Learning to Drive with GPT
2023 NeurIPS FMDM Workshop 396

📄 **[Read on arXiv](https://arxiv.org/abs/2310.01415)** GPT-Driver reformulates autonomous driving motion planning as a language modeling problem. Scene context (object positions, velocities, lane geometry) and ego vehi…

paper autonomous-driving vla llm +2
GPT-4 Technical Report
2023 arXiv 26297

📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…

nlp language-modeling transformer foundation-model +3
FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin
2023 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12058)** Occupancy prediction has emerged as a powerful perception paradigm for autonomous driving, predicting per-voxel semantic labels in 3D space to handle arbitrary obj…

autonomous-driving perception 3d-occupancy bev +3
Fb Bev Bev Representation From Forward Backward View Transformations
2023 ICCV 150

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

autonomous-driving perception bev transformer +1
DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
2023 arXiv 241

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09245)** DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than re…

paper autonomous-driving vla llm +2
DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
2023 ICCV 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2308.00398)** DriveAdapter (Jia et al., ICCV 2023) identifies and addresses a fundamental structural problem in end-to-end autonomous driving: the tight coupling between percept…

autonomous-driving end-to-end planning imitation-learning +2
Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
2023 arXiv

📄 **[Read on arXiv](https://arxiv.org/abs/2309.10228)** Drive as You Speak (DAYS) proposes a framework for enabling natural language interaction between human passengers and autonomous vehicles using large language mode…

paper autonomous-driving llm planning +3
Direct Preference Optimization Your Language Model Is Secretly A Reward Model
2023 NeurIPS 2023 8520

📄 **[Read on arXiv](https://arxiv.org/abs/2305.18290)** Aligning large language models (LLMs) with human preferences has traditionally required reinforcement learning from human feedback (RLHF), a complex multi-stage pi…

nlp reinforcement-learning language-modeling alignment +2
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
2023 CVPR 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

autonomous-driving perception bev transformer +2
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
2022 IEEE TPAMI 2023 600

📄 **[Read on arXiv](https://arxiv.org/abs/2205.15997)** TransFuser (Chitta et al., 2022) is a foundational paper for transformer-based sensor fusion in end-to-end autonomous driving. The key problem it addresses is how…

paper autonomous-driving e2e transformer +1
Training Language Models to Follow Instructions with Human Feedback
2022 NeurIPS 2022 24355

📄 **[Read on arXiv](https://arxiv.org/abs/2203.02155)** Large language models like GPT-3 are trained on vast internet corpora to predict the next token, but this objective is fundamentally misaligned with the goal of fo…

nlp reinforcement-learning language-modeling alignment +2
Training Compute-Optimal Large Language Models
2022 arXiv 4116

📄 **[Read on arXiv](https://arxiv.org/abs/2203.15556)** The Chinchilla paper (Hoffmann et al., DeepMind, 2022) is one of the most consequential papers in the LLM era because it corrected the field's scaling intuition. K…

nlp language-modeling transformer foundation-model +1
Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)
2022 JMLR 2024 3987

📄 **[Read on arXiv](https://arxiv.org/abs/2210.11416)** Large language models exhibit strong few-shot capabilities, but their ability to follow instructions and generalize to unseen tasks remains limited without targete…

nlp transformer instruction-tuning chain-of-thought +4
RT-1: Robotics Transformer for Real-World Control at Scale
2022 arXiv 2019

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

paper robotics vla transformer
Palm Scaling Language Modeling With Pathways
2022 JMLR 9058

📄 **[Read on arXiv](https://arxiv.org/abs/2204.02311)** PaLM (Pathways Language Model) is a 540-billion parameter dense decoder-only Transformer language model trained by Google using the Pathways distributed training s…

transformer language-modeling scaling foundation-model +3
Lora Low Rank Adaptation Of Large Language Models
2022 ICLR 2022 29175

📄 **[Read on arXiv](https://arxiv.org/abs/2106.09685)** As pretrained language models grow to hundreds of billions of parameters, full fine-tuning -- updating every weight for each downstream task -- becomes prohibitive…

nlp transformer language-modeling foundation-model +1
High-Resolution Image Synthesis with Latent Diffusion Models
2022 CVPR 2022 31987

📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…

diffusion generative-models computer-vision image-generation +2
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
2022 arXiv 2022 8653

📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…

computer-vision diffusion multimodal foundation-model +2
Flamingo: a Visual Language Model for Few-Shot Learning
2022 NeurIPS 2022 7824

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

multimodal foundation-model computer-vision nlp +3
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
2022 NeurIPS 2022 16871

📄 **[Read on arXiv](https://arxiv.org/abs/2201.11903)** Wei et al., arXiv 2201.11903, 2022 (NeurIPS 2022). - [Paper](https://arxiv.org/abs/2201.11903) Chain-of-thought (CoT) prompting demonstrates that including interme…

paper ilya-30 llm prompting +2
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
2022 ICML 2022 8650

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

multimodal foundation-model computer-vision nlp +4
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
2022 ECCV 1826

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

paper autonomous-driving perception bev +1
A Generalist Agent
2022 TMLR 1018

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

paper robotics vla generalist-agent
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021 ICCV 2021 44596

📄 **[Read on arXiv](https://arxiv.org/abs/2103.14030)** Vision Transformers (ViT) demonstrated that pure transformer architectures could match or exceed CNNs on image classification, but ViT's design introduced two fund…

computer-vision transformer image-classification object-detection +3
Prefix Tuning Optimizing Continuous Prompts For Generation
2021 ACL 2021 6753

📄 **[Read on arXiv](https://arxiv.org/abs/2101.00190)** Large pretrained language models like GPT-2 and BART achieve strong performance on generation tasks, but full fine-tuning requires storing a separate copy of all m…

nlp transformer parameter-efficient language-modeling +1
On The Opportunities And Risks Of Foundation Models
2021 arXiv (Stanford HAI) 6057

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

foundation-model nlp computer-vision robotics +3
Learning Transferable Visual Models From Natural Language Supervision
2021 ICML 2021 57987

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

computer-vision multimodal foundation-model transformer +3
Exploring Simple Siamese Representation Learning
2021 CVPR 2021 6444

📄 **[Read on arXiv](https://arxiv.org/abs/2011.10566)** SimSiam (Simple Siamese) demonstrates that self-supervised visual representation learning can be dramatically simplified while maintaining competitive performance.…

computer-vision self-supervised-learning representation-learning siamese-networks +1
Emerging Properties in Self-Supervised Vision Transformers (DINO)
2021 ICCV 2021 10798

📄 **[Read on arXiv](https://arxiv.org/abs/2104.14294)** DINO (self-DIstillation with NO labels) demonstrates that self-supervised learning with Vision Transformers produces features with remarkable emergent properties t…

computer-vision self-supervised-learning transformer vision-transformer +3
Diffusion Models Beat GANs on Image Synthesis
2021 NeurIPS 2021 13548

📄 **[Read on arXiv](https://arxiv.org/abs/2105.05233)** This paper by Dhariwal and Nichol (OpenAI, 2021) demonstrates that diffusion models can surpass GANs on image synthesis for the first time, achieving state-of-the-…

computer-vision diffusion generative-models image-generation +1
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2021 ICLR 2021 91128

📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…

ilya-30 vision-transformer computer-vision transformer +2
VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
2020 CVPR 1035

📄 **[Read on arXiv](https://arxiv.org/abs/2005.04259)** VectorNet (Gao et al., Waymo/Google, CVPR 2020) is a foundational paper that moved motion prediction and map encoding away from rasterized image-based representati…

paper autonomous-driving prediction vectorized-representation
Scaling Laws for Neural Language Models
2020 arXiv 7436

📄 **[Read on arXiv](https://arxiv.org/abs/2001.08361)** This is the canonical early scaling-law paper for language models, authored by Kaplan et al. at OpenAI. It demonstrated that neural language model cross-entropy lo…

paper ilya-30 llm scaling +1
Nuscenes A Multimodal Dataset For Autonomous Driving
2020 CVPR 7791

📄 **[Read on arXiv](https://arxiv.org/abs/1903.11027)** nuScenes is a large-scale multimodal dataset for autonomous driving that provides synchronized data from 6 cameras (360-degree coverage), 1 LiDAR, 5 radars, GPS, a…

paper autonomous-driving benchmark dataset
Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D
2020 ECCV 1510

📄 **[Read on arXiv](https://arxiv.org/abs/2008.05711)** Lift, Splat, Shoot (LSS) introduced a differentiable pipeline for transforming multi-camera images into a unified bird's-eye view (BEV) representation without requ…

paper autonomous-driving perception bev
Learning Lane Graph Representations for Motion Forecasting
2020 ECCV 750

📄 **[Read on arXiv](https://arxiv.org/abs/2007.13732)** LaneGCN introduces a graph neural network architecture for motion forecasting in autonomous driving that operates directly on the lane graph structure of HD maps.…

paper autonomous-driving prediction lanegcn
Language Models are Few-Shot Learners
2020 NeurIPS 56138

📄 **[Read on arXiv](https://arxiv.org/abs/2005.14165)** GPT-3 is a 175 billion parameter autoregressive language model that demonstrated a remarkable emergent capability: in-context learning, where the model performs ne…

paper llm in-context-learning foundation
Denoising Diffusion Probabilistic Models
2020 NeurIPS 2020 28939

📄 **[Read on arXiv](https://arxiv.org/abs/2006.11239)** Ho, Jain, and Abbeel, NeurIPS, 2020. - [Paper](https://arxiv.org/abs/2006.11239) Denoising Diffusion Probabilistic Models (DDPM) demonstrates that high-quality ima…

paper ilya-30 generative-models diffusion +1
Talk2Car: Taking Control of Your Self-Driving Car
2019 EMNLP-IJCNLP 182

📄 **[Read on arXiv](https://arxiv.org/abs/1909.10838)** For autonomous vehicles to be truly useful as personal transportation, passengers should be able to issue natural-language commands like "park behind that blue car…

paper autonomous-driving vla grounding +2
Learning by Cheating
2019 CoRL 632

📄 **[Read on arXiv](https://arxiv.org/abs/1912.12294)** Learning by Cheating introduces a two-stage training paradigm for end-to-end autonomous driving that has become one of the most influential design patterns in the…

paper autonomous-driving imitation-learning privileged-supervision
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
2019 NeurIPS 2100

📄 **[Read on arXiv](https://arxiv.org/abs/1811.06965)** GPipe introduces micro-batch pipeline parallelism as a practical method for training neural networks too large to fit on a single accelerator. The core idea is to…

paper ilya-30 distributed-training pipeline-parallelism +2
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
2019 RSS 2019 844

📄 **[Read on arXiv](https://arxiv.org/abs/1812.03079)** Bansal, Krizhevsky, Ogale (Waymo Research), RSS, 2019. - [Paper](https://arxiv.org/abs/1812.03079) ChauffeurNet is Waymo's mid-level imitation learning system that…

paper autonomous-driving imitation-learning planning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019 NAACL 112487

📄 **[Read on arXiv](https://arxiv.org/abs/1810.04805)** Devlin, Chang, Lee, Toutanova (Google AI Language), NAACL, 2019. - [Paper](https://aclanthology.org/N19-1423/) - [arXiv](https://arxiv.org/abs/1810.04805) BERT (Bi…

paper llm transformer foundation
Textual Explanations for Self-Driving Vehicles
2018 ECCV 427

📄 **[Read on arXiv](https://arxiv.org/abs/1807.11546)** End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et…

paper autonomous-driving vla explainability +2
Relational Recurrent Neural Networks
2018 NeurIPS 2018 220

📄 **[Read on arXiv](https://arxiv.org/abs/1806.01822)** Traditional RNNs (LSTMs, GRUs) compress all sequential information into a single fixed-size hidden vector, which fundamentally limits their ability to store and re…

paper ilya-30 recurrent-neural-networks relational-reasoning +1
End-to-end Driving via Conditional Imitation Learning
2018 ICRA 1227

📄 **[Read on arXiv](https://arxiv.org/abs/1710.02410)** This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level comma…

paper autonomous-driving imitation-learning e2e +1
Neural Message Passing For Quantum Chemistry
2017 ICML 2017 8754

📄 **[Read on arXiv](https://arxiv.org/abs/1704.01212)** This paper provided the conceptual unification that the graph neural network field needed. By showing that seemingly different architectures -- GCN, GraphSAGE, Gat…

paper ilya-30 graph-neural-networks molecular-property-prediction +1
Kolmogorov Complexity and Algorithmic Randomness
2017 AMS Mathematical Surveys and Monographs 106

📄 **[AMS Book Page](https://bookstore.ams.org/surv-220)** This monograph by Shen, Uspensky, and Vereshchagin is the definitive modern reference on algorithmic information theory. The central concept is Kolmogorov comple…

paper ilya-30 information-theory kolmogorov-complexity +2
CARLA: An Open Urban Driving Simulator
2017 CoRL 6490

Dosovitskiy, Ros, Codevilla, Lopez, Koltun (Intel Labs / Toyota Research Institute / CVC Barcelona), CoRL, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1711.03938)** CARLA (Car Learning to Act) is an open-source simu…

paper autonomous-driving benchmark simulator
Attention Is All You Need
2017 NeurIPS 171783

📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…

paper ilya-30 llm transformer +3
A Simple Neural Network Module for Relational Reasoning
2017 NeurIPS 2017 1679

Santoro, Raposo, Barrett, Malinowski, Pascanu, Battaglia, Lillicrap (DeepMind), NeurIPS, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1706.01427)** Relation Networks (RNs) are a simple neural network module for relat…

paper ilya-30 relational-reasoning visual-question-answering +1
Variational Lossy Autoencoder
2016 ICLR 2017 700

📄 **[Read on arXiv](https://arxiv.org/abs/1611.02731)** The Variational Lossy Autoencoder (VLAE) by Chen, Kingma, Salimans, Duan, Dhariwal, Schulman, Sutskever, and Abbeel (2016) addresses the fundamental tension in VAE…

paper ilya-30 generative-models variational-autoencoders +1
Order Matters Sequence To Sequence For Sets
2016 ICLR 1018

📄 **[Read on arXiv](https://arxiv.org/abs/1511.06391)** This paper by Samy Bengio, Oriol Vinyals, and Manjunath Kudlur challenges a core assumption in sequence modeling: that the order of input and output data is merely…

paper ilya-30 sequence-to-sequence set-modeling +2
Identity Mappings in Deep Residual Networks
2016 ECCV 2016 11060

📄 **[Read on arXiv](https://arxiv.org/abs/1603.05027)** This paper, a follow-up to the original ResNet work, provides both theoretical analysis and empirical evidence that the arrangement of operations within residual b…

paper ilya-30 residual-networks computer-vision +1
End to End Learning for Self-Driving Cars
2016 arXiv 4537

📄 **[Read on arXiv](https://arxiv.org/abs/1604.07316)** This paper from NVIDIA, commonly known as "DAVE-2" or the "NVIDIA end-to-end driving paper," demonstrates that a single convolutional neural network can learn to m…

paper autonomous-driving e2e
Understanding LSTM Networks
2015 Blog Post (colah.github.io)

📄 **[Read Blog Post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)** Christopher Olah's 2015 blog post is a widely used pedagogical reference for understanding LSTM internals. The post explains why vanilla…

paper ilya-30 lstm rnn +2
The Unreasonable Effectiveness of Recurrent Neural Networks
2015 Blog Post (karpathy.github.io)

📄 **[Read Blog Post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)** Andrej Karpathy's 2015 blog post offers a vivid qualitative demonstration that character-level recurrent neural networks with LSTM cells c…

paper ilya-30 rnn lstm +2
Pointer Networks
2015 NeurIPS 3380

📄 **[Read on arXiv](https://arxiv.org/abs/1506.03134)** Pointer Networks repurpose the attention mechanism as an output distribution, replacing the fixed output vocabulary of sequence-to-sequence models with attention w…

paper ilya-30 attention sequence-to-sequence +2
Multi Scale Context Aggregation By Dilated Convolutions
2015 ICLR 2016 9295

📄 **[Read on arXiv](https://arxiv.org/abs/1511.07122)** This paper introduced dilated (atrous) convolutions as a principled alternative to the downsample-then-upsample paradigm for dense prediction tasks. By inserting g…

paper ilya-30 computer-vision semantic-segmentation +1
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
2015 ICML 2016 3131

Amodei et al., ICML, 2016. 📄 **[Read on arXiv](https://arxiv.org/abs/1512.02595)** Deep Speech 2 is an end-to-end speech recognition system where a single RNN trained with CTC loss on spectrograms replaces the entire tr…

paper ilya-30 speech-recognition end-to-end-learning +1
Deep Residual Learning for Image Recognition
2015 CVPR 2016 224592

📄 **[Read on arXiv](https://arxiv.org/abs/1512.03385)** He, Zhang, Ren, Sun (Microsoft Research), CVPR, 2016. - [Paper](https://arxiv.org/abs/1512.03385) Deep Residual Learning introduces skip connections that add the i…

paper ilya-30 computer-vision residual-networks +1
CS231n: Deep Learning for Computer Vision
2015 Stanford University Course

📄 **[Course Website](https://cs231n.stanford.edu/)** Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing). - [Course](https://cs231n.stanford.edu/) CS231n is a widely used Stanford deep learning for computer v…

paper ilya-30 computer-vision convolutional-neural-networks +2
Recurrent Neural Network Regularization
2014 arXiv (1409.2329) 2986

📄 **[Read on arXiv](https://arxiv.org/abs/1409.2329)** This paper discovered that dropout can be successfully applied to LSTMs if it is restricted to non-recurrent (feedforward) connections only, preserving the LSTM's a…

paper ilya-30 rnn lstm +3
Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton
2014 arXiv 2014 26

📄 **[Read on arXiv](https://arxiv.org/abs/1405.6903)** This paper bridges thermodynamics and computational complexity to formalize a deep intuition: mixing cream into coffee produces increasingly complex patterns (swirl…

paper ilya-30 complexity-theory information-theory +1
Neural Turing Machines
2014 arXiv (presented at NIPS 2014 workshop) 2505

📄 **[Read on arXiv](https://arxiv.org/abs/1410.5401)** Neural Turing Machines (NTMs) augment neural networks with a differentiable external memory matrix and soft attention-based read/write heads, enabling them to learn…

paper ilya-30 memory-augmented-networks attention +1
Neural Machine Translation by Jointly Learning to Align and Translate
2014 ICLR 2015 29150

📄 **[Read on arXiv](https://arxiv.org/abs/1409.0473)** This paper introduced the attention mechanism to deep learning, arguably the single most influential architectural innovation leading to modern transformers and LLM…

paper ilya-30 attention-mechanism machine-translation +1
ImageNet Classification with Deep Convolutional Neural Networks
2012 NeurIPS 2012 127906

📄 **[Read Paper](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)** AlexNet, as this paper's architecture came to be known, is a deep convolutional neural network trained on GPUs th…

paper ilya-30 cnn computer-vision +3
The First Law of Complexodynamics
2011 Blog Post (Shtetl-Optimized)

📄 **[Read Blog Post](https://scottaaronson.blog/?p=762)** Scott Aaronson's blog post highlights an asymmetry between entropy and complexity as a way of thinking about structure formation in physical and computational sy…

paper ilya-30 complexity-theory information-theory +1
Machine Super Intelligence
2008 PhD Thesis, University of Lugano 63

📄 **[Read Thesis](https://www.vetta.org/documents/Machine_Super_Intelligence.pdf)** Shane Legg's 2008 PhD thesis provides perhaps the most rigorous mathematical definition of general intelligence, grounding informal int…

paper ilya-30 agi intelligence-measurement +1
A Tutorial Introduction to the Minimum Description Length Principle
2004 arXiv / MIT Press 381

📄 **[Read on arXiv](https://arxiv.org/abs/math/0406077)** Grünwald, arXiv math/0406077 / MIT Press, 2004. - [Paper](https://arxiv.org/abs/math/0406077) The Minimum Description Length (MDL) principle formalizes Occam's r…

paper ilya-30 information-theory model-selection +2
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
1993 COLT 1279

📄 **[Read Paper](https://www.cs.toronto.edu/~hinton/absps/colt93.pdf)** This paper by Hinton and van Camp bridges information theory and neural network generalization by proposing that model complexity should be measure…

paper ilya-30 regularization information-theory +3
Source Title

source