Research Timeline

Publications grouped by research direction across time.

VLA / Driving

23

2018

End-to-end Driving via Conditional Imitation Learning

Textual Explanations for Self-Driving Vehicles

2019

Talk2Car: Taking Control of Your Self-Driving Car

2023

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States

GPT-Driver: Learning to Drive with GPT

Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving

2024

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving

DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model

DriveLM: Driving with Graph Visual Question Answering

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Lmdrive Closed Loop End To End Driving With Large Language Models

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

2025

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Autovala Vision Language Action Model For End To End Autonomous Driving

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

EMMA: End-to-End Multimodal Model for Autonomous Driving

Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

End-to-End

25

2016

End to End Learning for Self-Driving Cars

2018

End-to-end Driving via Conditional Imitation Learning

2019

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Learning by Cheating

2022

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

2023

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

Robocat A Self Improving Generalist Agent For Robotic Manipulation

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

2024

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

Lmdrive Closed Loop End To End Driving With Large Language Models

Octo An Open Source Generalist Robot Policy

RT-H: Action Hierarchies Using Language

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

RoboVLMs: What Matters in Building Vision-Language-Action Models

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

2025

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

2026

DrivoR: Driving on Registers

Perception

33

2020

Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D

2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

2023

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Fb Bev Bev Representation From Forward Backward View Transformations

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

2024

Bevnext Reviving Dense Bev Frameworks For 3D Object Detection

Drive-OccWorld: Driving in the Occupancy World

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction

Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

SparseOcc: Fully Sparse 3D Occupancy Prediction

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

VLP: Vision Language Planning for Autonomous Driving

YOLOv10: Real-Time End-to-End Object Detection

2025

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

EMMA: End-to-End Multimodal Model for Autonomous Driving

GaussRender: Learning 3D Occupancy with Gaussian Rendering

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting

Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation

OccMamba: Semantic Occupancy Prediction with State Space Models

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

2026

DrivoR: Driving on Registers

Planning

39

2019

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

2023

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States

GPT-Driver: Learning to Drive with GPT

Languagempc Large Language Models As Decision Makers For Autonomous Driving

Planning-oriented Autonomous Driving

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

2024

Agent-Driver: A Language Agent for Autonomous Driving

Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving

Drive-OccWorld: Driving in the Occupancy World

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Occworld Learning A 3D Occupancy World Model For Autonomous Driving

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Talk2Drive Towards Personalized Autonomous Driving With Large Language Models

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

VLP: Vision Language Planning for Autonomous Driving

Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability

2025

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction

CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving

EMMA: End-to-End Multimodal Model for Autonomous Driving

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation

Momad Momentum Aware Planning In End To End Autonomous Driving

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model

2026

DrivoR: Driving on Registers

SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

Prediction

4

2020

Learning Lane Graph Representations for Motion Forecasting

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

2024

Drive-OccWorld: Driving in the Occupancy World

2025

BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction

Foundation Models

45

2017

Attention Is All You Need

2017 171783 cit.

2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2019 112487 cit.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

2020

Language Models are Few-Shot Learners

2020 56138 cit.

Scaling Laws for Neural Language Models

2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2021 91128 cit.

Emerging Properties in Self-Supervised Vision Transformers (DINO)

2021 10798 cit.

Learning Transferable Visual Models From Natural Language Supervision

2021 57987 cit.

On The Opportunities And Risks Of Foundation Models

Prefix Tuning Optimizing Continuous Prompts For Generation

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

2021 44596 cit.

2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022 16871 cit.

Flamingo: a Visual Language Model for Few-Shot Learning

High-Resolution Image Synthesis with Latent Diffusion Models

2022 31987 cit.

Lora Low Rank Adaptation Of Large Language Models

2022 29175 cit.

Palm Scaling Language Modeling With Pathways

RT-1: Robotics Transformer for Real-World Control at Scale

Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)

Training Compute-Optimal Large Language Models

Training Language Models to Follow Instructions with Human Feedback

2022 24355 cit.

2023

Direct Preference Optimization Your Language Model Is Secretly A Reward Model

GPT-4 Technical Report

2023 26297 cit.

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023 22411 cit.

Qlora Efficient Finetuning Of Quantized Language Models

Robocat A Self Improving Generalist Agent For Robotic Manipulation

Segment Anything

2023 19692 cit.

Toolformer: Language Models Can Teach Themselves to Use Tools

Visual Instruction Tuning (LLaVA)

2023 13533 cit.

2024

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

Mixtral Of Experts

Octo An Open Source Generalist Robot Policy

RT-H: Action Hierarchies Using Language

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

RoboVLMs: What Matters in Building Vision-Language-Action Models

SAM 2: Segment Anything in Images and Videos

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

2025

Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning

Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities

Gemma 3 Technical Report

Qwen3 Technical Report

Robotics

39

2021

On The Opportunities And Risks Of Foundation Models

2022

A Generalist Agent

RT-1: Robotics Transformer for Real-World Control at Scale

2023

PaLM-E: An Embodied Multimodal Language Model

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Robocat A Self Improving Generalist Agent For Robotic Manipulation

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

2024

3D-VLA: A 3D Vision-Language-Action Generative World Model

Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

Octo An Open Source Generalist Robot Policy

OpenVLA: An Open-Source Vision-Language-Action Model

RT-H: Action Hierarchies Using Language

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

RoboVLMs: What Matters in Building Vision-Language-Action Models

Robotic Control via Embodied Chain-of-Thought Reasoning

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

Unisim Learning Interactive Real World Simulators

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

pi0: A Vision-Language-Action Flow Model for General Robot Control

2025

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Gemini Robotics Bringing Ai Into The Physical World

Groot N1 An Open Foundation Model For Generalist Humanoid Robots

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

Knowledge Insulating Vision-Language-Action Models

OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning

RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

Self-Improving Embodied Foundation Models

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

SpatialVLA: Exploring Spatial Representations for VLA Models

Towards Embodiment Scaling Laws in Robot Locomotion

UniAct: Universal Actions for Enhanced Embodied Foundation Models

pi*0.6: A VLA That Learns From Experience

pi0.5: A Vision-Language-Action Model with Open-World Generalization

Ilya Top 30

29

1993

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

2004

A Tutorial Introduction to the Minimum Description Length Principle

2008

Machine Super Intelligence

2011

The First Law of Complexodynamics

2012

ImageNet Classification with Deep Convolutional Neural Networks

2012 127906 cit.

2014

Neural Machine Translation by Jointly Learning to Align and Translate

2014 29150 cit.

Neural Turing Machines

Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton

Recurrent Neural Network Regularization

2015

CS231n: Deep Learning for Computer Vision

Deep Residual Learning for Image Recognition

2015 224592 cit.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Multi Scale Context Aggregation By Dilated Convolutions

Pointer Networks

The Unreasonable Effectiveness of Recurrent Neural Networks

Understanding LSTM Networks

2016

Identity Mappings in Deep Residual Networks

2016 11060 cit.

Order Matters Sequence To Sequence For Sets

Variational Lossy Autoencoder

2017

A Simple Neural Network Module for Relational Reasoning

Attention Is All You Need

2017 171783 cit.

Kolmogorov Complexity and Algorithmic Randomness

Neural Message Passing For Quantum Chemistry

2018

Relational Recurrent Neural Networks

2019

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

2020

Denoising Diffusion Probabilistic Models

2020 28939 cit.

Scaling Laws for Neural Language Models

2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2021 91128 cit.

2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022 16871 cit.