Tags

A Generalist Agent

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.06175)** Reed et al., Transactions on Machine Learning Research (TMLR), 2022. - [Paper](https://arxiv.org/abs/2205.06175) Gato, developed by DeepMind, is a single transform…

A Simple Neural Network Module for Relational Reasoning

source-summary

Santoro, Raposo, Barrett, Malinowski, Pascanu, Battaglia, Lillicrap (DeepMind), NeurIPS, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1706.01427)** Relation Networks (RNs) are a simple neural network module for relat…

A Tutorial Introduction to the Minimum Description Length Principle

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/math/0406077)** Grünwald, arXiv math/0406077 / MIT Press, 2004. - [Paper](https://arxiv.org/abs/math/0406077) The Minimum Description Length (MDL) principle formalizes Occam's r…

Agent-Driver: A Language Agent for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2311.10813)** Agent-Driver reframes autonomous driving as a cognitive agent problem, positioning a large language model as the central reasoning and planning engine rather than…

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

source-summary

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

source-summary

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…

Attention Is All You Need

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1810.04805)** Devlin, Chang, Lee, Toutanova (Google AI Language), NAACL, 2019. - [Paper](https://aclanthology.org/N19-1423/) - [arXiv](https://arxiv.org/abs/1810.04805) BERT (Bi…

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

source-summary

**[Read on arXiv](https://arxiv.org/abs/2502.19694)** BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the l…

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14182)** BridgeAD tackles a critical limitation in end-to-end autonomous driving: the ineffective utilization of historical temporal information. Current systems either agg…

CARLA: An Open Urban Driving Simulator

source-summary

Dosovitskiy, Ros, Codevilla, Lopez, Koltun (Intel Labs / Toyota Research Institute / CVC Barcelona), CoRL, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1711.03938)** CARLA (Car Learning to Act) is an open-source simu…

CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.19908) CarPlanner (Zhejiang University + Cainiao Network, CVPR 2025) introduces a consistent autoregressive reinforcement learning planner that is the first RL-based planner to…

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2201.11903)** Wei et al., arXiv 2201.11903, 2022 (NeurIPS 2022). - [Paper](https://arxiv.org/abs/2201.11903) Chain-of-thought (CoT) prompting demonstrates that including interme…

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1812.03079)** Bansal, Krizhevsky, Ogale (Waymo Research), RSS, 2019. - [Paper](https://arxiv.org/abs/1812.03079) ChauffeurNet is Waymo's mid-level imitation learning system that…

CS231n: Deep Learning for Computer Vision

source-summary

📄 **[Course Website](https://cs231n.stanford.edu/)** Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing). - [Course](https://cs231n.stanford.edu/) CS231n is a widely used Stanford deep learning for computer v…

Deep Residual Learning for Image Recognition

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1512.03385)** He, Zhang, Ren, Sun (Microsoft Research), CVPR, 2016. - [Paper](https://arxiv.org/abs/1512.03385) Deep Residual Learning introduces skip connections that add the i…

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

source-summary

Amodei et al., ICML, 2016. 📄 **[Read on arXiv](https://arxiv.org/abs/1512.02595)** Deep Speech 2 is an end-to-end speech recognition system where a single RNN trained with CTC loss on spectrograms replaces the entire tr…

Denoising Diffusion Probabilistic Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2006.11239)** Ho, Jain, and Abbeel, NeurIPS, 2020. - [Paper](https://arxiv.org/abs/2006.11239) Denoising Diffusion Probabilistic Models (DDPM) demonstrates that high-quality ima…

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2411.15139) DiffusionDrive (HUST/Horizon Robotics, CVPR 2025 Highlight) proposes a truncated diffusion model for end-to-end autonomous driving that achieves real-time inference whil…

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

source-summary

**[Read on arXiv](https://arxiv.org/abs/2503.19757)** Dita introduces a scalable framework that leverages full Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffu…

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2309.10228)** Drive as You Speak (DAYS) proposes a framework for enabling natural language interaction between human passengers and autonomous vehicles using large language mode…

Drive-OccWorld: Driving in the Occupancy World

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2408.14197)** Drive-OccWorld introduces a vision-centric 4D occupancy forecasting world model that directly integrates with end-to-end planning. The core premise is that current…

DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2309.09777) DriveDreamer (ECCV 2024) is the first world model built entirely from real-world driving data, addressing fundamental limitations of prior approaches that relied on simu…

DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model

source-summary

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…

DriveGPT: Scaling Autoregressive Behavior Models for Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2412.14415) DriveGPT (Cruise, ICML 2025) is the first work to systematically study scaling laws for autoregressive behavior models in autonomous driving. Drawing inspiration from th…

DriveLM: Driving with Graph Visual Question Answering

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09245)** DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than re…

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.16278)** DriveMoE introduces a dual-level Mixture-of-Experts (MoE) architecture to driving Vision-Language-Action models. The key innovation is applying expert specializati…

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2402.12289)** DriveVLM proposes a hierarchical approach to integrating Vision-Language Models into autonomous driving, emphasizing scene understanding and multi-level planning r…

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2310.01957) Driving with LLMs (Wayve, ICRA 2024) is one of the first concrete demonstrations of using a large language model as the decision-making "brain" for autonomous driving. T…

DrivoR: Driving on Registers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

EMMA: End-to-End Multimodal Model for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

End to End Learning for Self-Driving Cars

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1604.07316)** This paper from NVIDIA, commonly known as "DAVE-2" or the "NVIDIA end-to-end driving paper," demonstrates that a single convolutional neural network can learn to m…

End-to-end Driving via Conditional Imitation Learning

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1710.02410)** This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level comma…

FAST: Efficient Action Tokenization for Vision-Language-Action Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2501.09747) FAST (Frequency-space Action Sequence Tokenization) introduces a novel action tokenizer for VLA models that leverages signal processing to dramatically compress robot ac…

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.17288)** GaussianFlowOcc (ICCV 2025) introduces a transformative approach to 3D semantic occupancy estimation for autonomous driving by replacing traditional…

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.04384)** GaussianFormer-2 addresses 3D semantic occupancy prediction for vision-centric autonomous driving by rethinking how 3D Gaussians represent occupied space. The origin…

GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01957)** Bird's-Eye View (BEV) perception faces a fundamental trade-off between accuracy and computational efficiency. High-performing 3D projection methods like BEVFormer…

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2408.11447)** GaussianOcc by Gan et al. (University of Tokyo / RIKEN / South China University of Technology / SIAT-CAS) is a systematic method that applies Gaussi…

GaussRender: Learning 3D Occupancy with Gaussian Rendering

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.05040)** GaussRender by Chambon et al. (Valeo AI / Sorbonne, ICCV 2025) introduces a plug-and-play training-time module that improves 3D occupancy prediction…

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.13193)** GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse…

GenAD: Generative End-to-End Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2402.11502) GenAD (ECCV 2024) reframes end-to-end autonomous driving as a generative modeling problem, simultaneously generating future trajectories for all traffic participants rat…

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation

source-summary

[Read on arXiv](https://arxiv.org/abs/2503.05689) GoalFlow (Horizon Robotics / HKU, CVPR 2025) introduces a goal-driven flow matching framework for multimodal trajectory generation in autonomous driving. The method achi…

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1811.06965)** GPipe introduces micro-batch pipeline parallelism as a practical method for training neural networks too large to fit on a single accelerator. The core idea is to…

GPT-Driver: Learning to Drive with GPT

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.01415)** GPT-Driver reformulates autonomous driving motion planning as a language modeling problem. Scene context (object positions, velocities, lane geometry) and ego vehi…

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

source-summary

:page_facing_up: **[Read at Figure AI](https://www.figure.ai/news/helix)** Helix (Figure AI, Technical Report February 2025) is the first vision-language-action model to achieve high-rate continuous control of an entire…

Identity Mappings in Deep Residual Networks

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1603.05027)** This paper, a follow-up to the original ResNet work, provides both theoretical analysis and empirical evidence that the arrangement of operations within residual b…

ImageNet Classification with Deep Convolutional Neural Networks

source-summary

📄 **[Read Paper](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)** AlexNet, as this paper's architecture came to be known, is a deep convolutional neural network trained on GPUs th…

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

source-summary

[Read on arXiv](https://arxiv.org/abs/2312.03031) This paper (CVPR 2024, NVIDIA / Nanjing University) delivers a "wake-up call" to the autonomous driving research community by demonstrating that simple baselines using o…

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

source-summary

📄 **[Read Paper](https://www.cs.toronto.edu/~hinton/absps/colt93.pdf)** This paper by Hinton and van Camp bridges information theory and neural network generalization by proposing that model complexity should be measure…

Knowledge Insulating Vision-Language-Action Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2505.23705) This paper from Physical Intelligence identifies and addresses a critical problem in VLA training: gradient interference causes the pre-trained VLM backbone to degrade w…

Kolmogorov Complexity and Algorithmic Randomness

source-summary

📄 **[AMS Book Page](https://bookstore.ams.org/surv-220)** This monograph by Shen, Uspensky, and Vereshchagin is the definitive modern reference on algorithmic information theory. The central concept is Kolmogorov comple…

Language Models are Few-Shot Learners

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2005.14165)** GPT-3 is a 175 billion parameter autoregressive language model that demonstrated a remarkable emergent capability: in-context learning, where the model performs ne…

LAW: Enhancing End-to-End Autonomous Driving with Latent World Model

source-summary

[Read on arXiv](https://arxiv.org/abs/2406.08481) LAW (CASIA, ICLR 2025) introduces a self-supervised latent world model that enhances end-to-end autonomous driving by learning to predict future latent states of the dri…

Learning by Cheating

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1912.12294)** Learning by Cheating introduces a two-stage training paradigm for end-to-end autonomous driving that has become one of the most influential design patterns in the…

Learning Lane Graph Representations for Motion Forecasting

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2007.13732)** LaneGCN introduces a graph neural network architecture for motion forecasting in autonomous driving that operates directly on the lane graph structure of HD maps.…

Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2008.05711)** Lift, Splat, Shoot (LSS) introduced a differentiable pipeline for transforming multi-camera images into a unified bird's-eye view (BEV) representation without requ…

Lmdrive Closed Loop End To End Driving With Large Language Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07488)** LMDrive is the first system to demonstrate and benchmark LLM-based driving in closed-loop simulation, introducing the LangAuto benchmark with ~64K instruction-foll…

Machine Super Intelligence

source-summary

📄 **[Read Thesis](https://www.vetta.org/documents/Machine_Super_Intelligence.pdf)** Shane Legg's 2008 PhD thesis provides perhaps the most rigorous mathematical definition of general intelligence, grounding informal int…

Multi Scale Context Aggregation By Dilated Convolutions

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1511.07122)** This paper introduced dilated (atrous) convolutions as a principled alternative to the downsample-then-upsample paradigm for dense prediction tasks. By inserting g…

Neural Machine Translation by Jointly Learning to Align and Translate

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1409.0473)** This paper introduced the attention mechanism to deep learning, arguably the single most influential architectural innovation leading to modern transformers and LLM…

Neural Message Passing For Quantum Chemistry

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1704.01212)** This paper provided the conceptual unification that the graph neural network field needed. By showing that seemingly different architectures -- GCN, GraphSAGE, Gat…

Neural Turing Machines

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1410.5401)** Neural Turing Machines (NTMs) augment neural networks with a differentiable external memory matrix and soft attention-based read/write heads, enabling them to learn…

Nuscenes A Multimodal Dataset For Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1903.11027)** nuScenes is a large-scale multimodal dataset for autonomous driving that provides synchronized data from 6 cameras (360-degree coverage), 1 LiDAR, 5 radars, GPS, a…

OccMamba: Semantic Occupancy Prediction with State Space Models

source-summary

**[Read on arXiv](https://arxiv.org/abs/2408.09859)** OccMamba is the first Mamba-based network for semantic occupancy prediction, replacing transformer architectures' quadratic complexity with Mamba's linear complexity…

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.19645) OpenVLA-OFT presents a systematic empirical study of fine-tuning strategies for Vision-Language-Action models, identifying a recipe that boosts the original OpenVLA from…

OpenVLA: An Open-Source Vision-Language-Action Model

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2406.09246)** OpenVLA is a 7-billion parameter open-source vision-language-action model that demonstrates generalist robotic manipulation by fine-tuning a pretrained vision-lang…

Order Matters Sequence To Sequence For Sets

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1511.06391)** This paper by Samy Bengio, Oriol Vinyals, and Manjunath Kudlur challenges a core assumption in sequence modeling: that the order of input and output data is merely…

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…

PaLM-E: An Embodied Multimodal Language Model

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2303.03378)** PaLM-E is a 562-billion parameter embodied multimodal language model created by Google that injects continuous sensor observations (images, point clouds, robot sta…

PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving

source-summary

[Read on CVF Open Access](https://openaccess.thecvf.com/content/CVPR2024/html/Weng_PARA-Drive_Parallelized_Architecture_for_Real-time_Autonomous_Driving_CVPR_2024_paper.html) PARA-Drive (NVIDIA Research / USC / Stanford…

pi*0.6: A VLA That Learns From Experience

source-summary

[Read on arXiv](https://arxiv.org/abs/2511.14759) pi*0.6 extends the pi0/pi0.5/pi0.6 VLA family with the ability to learn from autonomous deployment experience using reinforcement learning. While prior models learn prim…

pi0.5: A Vision-Language-Action Model with Open-World Generalization

source-summary

[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…

pi0: A Vision-Language-Action Flow Model for General Robot Control

source-summary

[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…

Planning-oriented Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.10156)** UniAD (Unified Autonomous Driving) is a planning-oriented end-to-end framework that unifies perception, prediction, and planning into a single differentiable netwo…

Pointer Networks

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1506.03134)** Pointer Networks repurpose the attention mechanism as an output distribution, replacing the fixed output vocabulary of sequence-to-sequence models with attention w…

Pseudo-Simulation for Autonomous Driving (NAVSIM v2)

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2506.04218)** Pseudo-Simulation by Cao, Hallgarten et al. (Tubingen / Shanghai AI Lab / NVIDIA / Stanford, CoRL 2025) introduces a novel evaluation paradigm for a…

Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1405.6903)** This paper bridges thermodynamics and computational complexity to formalize a deep intuition: mixing cream into coffee produces increasingly complex patterns (swirl…

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.12725)** RaCFormer by Chu et al. (USTC, CVPR 2025) addresses a fundamental problem in radar-camera fusion for 3D object detection: the image-to-BEV transform…

RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

source-summary

[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…

ReAct: Synergizing Reasoning and Acting in Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2210.03629)** Large language models had demonstrated two powerful capabilities in isolation: chain-of-thought reasoning for multi-step problem solving, and action generation for…

Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.03661)** Reason2Drive provides the largest reasoning chain dataset for driving (>600K video-text pairs from nuScenes, Waymo, and ONCE) and introduces an aggregated evaluati…

Recurrent Neural Network Regularization

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1409.2329)** This paper discovered that dropout can be successfully applied to LSTMs if it is restricted to non-recurrent (feedforward) connections only, preserving the LSTM's a…

Relational Recurrent Neural Networks

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1806.01822)** Traditional RNNs (LSTMs, GRUs) compress all sequential information into a single fixed-size hidden vector, which fundamentally limits their ability to store and re…

Robotic Control via Embodied Chain-of-Thought Reasoning

source-summary

[Read on arXiv](https://arxiv.org/abs/2407.08693) ECoT (UC Berkeley / Stanford / University of Warsaw, 2024) introduces Embodied Chain-of-Thought reasoning for Vision-Language-Action (VLA) models, demonstrating that gen…

RT-1: Robotics Transformer for Real-World Control at Scale

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2307.15818)** RT-2 is the defining paper for the modern Vision-Language-Action (VLA) paradigm. It demonstrates that large vision-language models (VLMs) pretrained on internet-sc…

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

Scaling Laws for Neural Language Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2001.08361)** This is the canonical early scaling-law paper for language models, authored by Kaplan et al. at OpenAI. It demonstrated that neural language model cross-entropy lo…

Self-Improving Embodied Foundation Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

source-summary

**[Read on arXiv](https://arxiv.org/abs/2506.01844)** SmolVLA is a 450M-parameter open-source VLA model from Hugging Face that demonstrates competitive performance with models 10x larger while being trainable on a singl…

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2505.16805) SOLVE proposes a synergistic framework that combines a Vision-Language Model (VLM) reasoning branch (SOLVE-VLM) with an end-to-end (E2E) driving network (SOLVE-E2E), con…

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2405.19620)** SparseDrive by Sun et al. (ICRA 2025) proposes a paradigm shift from dense BEV-based end-to-end driving to fully sparse scene representations. The c…

SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2603.29163)** SparseDriveV2 by Sun et al. (2026) pushes the performance boundary of scoring-based trajectory planning by demonstrating that "scoring is all you ne…

SpatialVLA: Exploring Spatial Representations for VLA Models

source-summary

[Read on arXiv](https://arxiv.org/abs/2501.15830) SpatialVLA addresses a fundamental limitation of existing VLA models: they operate on 2D visual inputs despite robot manipulation requiring understanding of 3D spatial r…

Talk2Car: Taking Control of Your Self-Driving Car

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1909.10838)** For autonomous vehicles to be truly useful as personal transportation, passengers should be able to issue natural-language commands like "park behind that blue car…

Textual Explanations for Self-Driving Vehicles

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1807.11546)** End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et…

The First Law of Complexodynamics

source-summary

📄 **[Read Blog Post](https://scottaaronson.blog/?p=762)** Scott Aaronson's blog post highlights an asymmetry between entropy and complexity as a way of thinking about structure formation in physical and computational sy…

The Unreasonable Effectiveness of Recurrent Neural Networks

source-summary

📄 **[Read Blog Post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)** Andrej Karpathy's 2015 blog post offers a vivid qualitative demonstration that character-level recurrent neural networks with LSTM cells c…

Towards Embodiment Scaling Laws in Robot Locomotion

source-summary

**[Read on arXiv](https://arxiv.org/abs/2505.05753)** This paper investigates whether increasing robot diversity during training improves generalization to unseen robots, analogous to how data scaling improves language…

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.15997)** TransFuser (Chitta et al., 2022) is a foundational paper for transformer-based sensor fusion in end-to-end autonomous driving. The key problem it addresses is how…

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.10601)** Language models are typically used in a left-to-right token-generation mode, which limits their ability to explore alternative reasoning paths or backtrack from mi…

Understanding LSTM Networks

source-summary

📄 **[Read Blog Post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)** Christopher Olah's 2015 blog post is a widely used pedagogical reference for understanding LSTM internals. The post explains why vanilla…

UniAct: Universal Actions for Enhanced Embodied Foundation Models

source-summary

**[Read on arXiv](https://arxiv.org/abs/2501.10105)** UniAct addresses a critical challenge in embodied AI: robot action data suffers from severe heterogeneity across platforms, control interfaces, and physical embodime…

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2303.12077)** VAD (Vectorized Scene Representation for Efficient Autonomous Driving) by Jiang et al. (ICCV 2023) is a pivotal paper in the shift from dense rasterized scene repr…

Variational Lossy Autoencoder

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1611.02731)** The Variational Lossy Autoencoder (VLAE) by Chen, Kingma, Salimans, Duan, Dhariwal, Schulman, Sutskever, and Abbeel (2016) addresses the fundamental tension in VAE…

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2005.04259)** VectorNet (Gao et al., Waymo/Google, CVPR 2020) is a foundational paper that moved motion prediction and map encoding away from rasterized image-based representati…

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.14803)** Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future…

VLP: Vision Language Planning for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01941)** End-to-end driving models typically output a single trajectory and trust it entirely, with no mechanism to evaluate whether the predicted path is safe before execu…

Pages tagged paper