Research Timeline
Publications grouped by research direction across time.
VLA / Driving
232018
End-to-end Driving via Conditional Imitation Learning
Textual Explanations for Self-Driving Vehicles
2019
Talk2Car: Taking Control of Your Self-Driving Car
2023
DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
GPT-Driver: Learning to Drive with GPT
Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
2024
Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
DriveLM: Driving with Graph Visual Question Answering
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Lmdrive Closed Loop End To End Driving With Large Language Models
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
VLP: Vision Language Planning for Autonomous Driving
2025
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Autovala Vision Language Action Model For End To End Autonomous Driving
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
End-to-End
252016
End to End Learning for Self-Driving Cars
2018
End-to-end Driving via Conditional Imitation Learning
2019
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
Learning by Cheating
2022
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
2023
DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
Robocat A Self Improving Generalist Agent For Robotic Manipulation
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
2024
Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Lmdrive Closed Loop End To End Driving With Large Language Models
Octo An Open Source Generalist Robot Policy
RT-H: Action Hierarchies Using Language
RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
RoboVLMs: What Matters in Building Vision-Language-Action Models
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
2025
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
2026
DrivoR: Driving on Registers
Perception
332020
Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D
2022
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
2023
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
Fb Bev Bev Representation From Forward Backward View Transformations
FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
2024
Bevnext Reviving Dense Bev Frameworks For 3D Object Detection
Drive-OccWorld: Driving in the Occupancy World
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction
Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
SparseOcc: Fully Sparse 3D Occupancy Prediction
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
VLP: Vision Language Planning for Autonomous Driving
YOLOv10: Real-Time End-to-End Object Detection
2025
BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
EMMA: End-to-End Multimodal Model for Autonomous Driving
GaussRender: Learning 3D Occupancy with Gaussian Rendering
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting
Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation
OccMamba: Semantic Occupancy Prediction with State Space Models
S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
2026
DrivoR: Driving on Registers
Planning
392019
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
2023
Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
GPT-Driver: Learning to Drive with GPT
Languagempc Large Language Models As Decision Makers For Autonomous Driving
Planning-oriented Autonomous Driving
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
VAD: Vectorized Scene Representation for Efficient Autonomous Driving
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
2024
Agent-Driver: A Language Agent for Autonomous Driving
Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving
Drive-OccWorld: Driving in the Occupancy World
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Occworld Learning A 3D Occupancy World Model For Autonomous Driving
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation
Talk2Drive Towards Personalized Autonomous Driving With Large Language Models
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
VLP: Vision Language Planning for Autonomous Driving
Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability
2025
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction
CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation
Momad Momentum Aware Planning In End To End Autonomous Driving
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
2026
DrivoR: Driving on Registers
SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving
Prediction
4Foundation Models
452017
Attention Is All You Need
2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
2020
Language Models are Few-Shot Learners
Scaling Laws for Neural Language Models
2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Emerging Properties in Self-Supervised Vision Transformers (DINO)
Learning Transferable Visual Models From Natural Language Supervision
On The Opportunities And Risks Of Foundation Models
Prefix Tuning Optimizing Continuous Prompts For Generation
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Flamingo: a Visual Language Model for Few-Shot Learning
High-Resolution Image Synthesis with Latent Diffusion Models
Lora Low Rank Adaptation Of Large Language Models
Palm Scaling Language Modeling With Pathways
RT-1: Robotics Transformer for Real-World Control at Scale
Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)
Training Compute-Optimal Large Language Models
Training Language Models to Follow Instructions with Human Feedback
2023
Direct Preference Optimization Your Language Model Is Secretly A Reward Model
GPT-4 Technical Report
Llama 2: Open Foundation and Fine-Tuned Chat Models
Mistral 7B
Qlora Efficient Finetuning Of Quantized Language Models
Robocat A Self Improving Generalist Agent For Robotic Manipulation
Segment Anything
Toolformer: Language Models Can Teach Themselves to Use Tools
Visual Instruction Tuning (LLaVA)
2024
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
Mixtral Of Experts
Octo An Open Source Generalist Robot Policy
RT-H: Action Hierarchies Using Language
RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
RoboVLMs: What Matters in Building Vision-Language-Action Models
SAM 2: Segment Anything in Images and Videos
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
2025
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning
Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities
Gemma 3 Technical Report
Qwen3 Technical Report
Robotics
392021
On The Opportunities And Risks Of Foundation Models
2022
A Generalist Agent
RT-1: Robotics Transformer for Real-World Control at Scale
2023
PaLM-E: An Embodied Multimodal Language Model
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Robocat A Self Improving Generalist Agent For Robotic Manipulation
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
2024
3D-VLA: A 3D Vision-Language-Action Generative World Model
Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Octo An Open Source Generalist Robot Policy
OpenVLA: An Open-Source Vision-Language-Action Model
RT-H: Action Hierarchies Using Language
RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
RoboVLMs: What Matters in Building Vision-Language-Action Models
Robotic Control via Embodied Chain-of-Thought Reasoning
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
Unisim Learning Interactive Real World Simulators
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
pi0: A Vision-Language-Action Flow Model for General Robot Control
2025
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Gemini Robotics Bringing Ai Into The Physical World
Groot N1 An Open Foundation Model For Generalist Humanoid Robots
Helix: A Vision-Language-Action Model for Generalist Humanoid Control
Knowledge Insulating Vision-Language-Action Models
OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning
RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation
Self-Improving Embodied Foundation Models
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SpatialVLA: Exploring Spatial Representations for VLA Models
Towards Embodiment Scaling Laws in Robot Locomotion
UniAct: Universal Actions for Enhanced Embodied Foundation Models
pi*0.6: A VLA That Learns From Experience
pi0.5: A Vision-Language-Action Model with Open-World Generalization
Ilya Top 30
291993
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
2004
A Tutorial Introduction to the Minimum Description Length Principle
2008
Machine Super Intelligence
2011
The First Law of Complexodynamics
2012
ImageNet Classification with Deep Convolutional Neural Networks
2014
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Turing Machines
Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton
Recurrent Neural Network Regularization
2015
CS231n: Deep Learning for Computer Vision
Deep Residual Learning for Image Recognition
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Multi Scale Context Aggregation By Dilated Convolutions
Pointer Networks
The Unreasonable Effectiveness of Recurrent Neural Networks
Understanding LSTM Networks
2016
Identity Mappings in Deep Residual Networks
Order Matters Sequence To Sequence For Sets
Variational Lossy Autoencoder
2017
A Simple Neural Network Module for Relational Reasoning
Attention Is All You Need
Kolmogorov Complexity and Algorithmic Randomness
Neural Message Passing For Quantum Chemistry
2018
Relational Recurrent Neural Networks
2019
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
2020
Denoising Diffusion Probabilistic Models
Scaling Laws for Neural Language Models
2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models