ESC

Tags

239 tags across the wiki

paper 114 autonomous-driving 92 foundation-model 55 transformer 53 vla 49 planning 42 robotics 41 computer-vision 36 perception 32 ilya-30 29 multimodal 29 nlp 26 end-to-end 24 language-modeling 24 llm 17 reasoning 17 imitation-learning 16 3d-occupancy 15 vlm 15 bev 14 diffusion 13 e2e 12 reinforcement-learning 12 world-model 12 chain-of-thought 10 benchmark 9 scaling 9 cross-embodiment 7 driving 6 gaussian-splatting 6 generative-models 6 image-classification 6 information-theory 6 questions 6 self-supervised 6 sources 6 alignment 5 attention 5 cnn 5 foundation 5 knowledge-distillation 5 language-model 5 prediction 5 simulation 5 evaluation 4 image-generation 4 instruction-tuning 4 mixture-of-experts 4 rnn 4 sequence-to-sequence 4 sparse-representation 4 video-prediction 4 explainability 3 flow-matching 3 lstm 3 map 3 occupancy 3 open-source 3 semantic-segmentation 3 sequence-modeling 3 trajectory-prediction 3 vectorized-representation 3 3d-detection 2 3d-perception 2 3d-reconstruction 2 action-representation 2 autonomy 2 autoregressive 2 bimanual 2 closed-loop 2 complexity-theory 2 dataset 2 deployment 2 distributed-training 2 efficient-inference 2 embodied 2 fine-tuning 2 foundation-models 2 foundational 2 gaussian-representation 2 generation 2 generative 2 human-interaction 2 humanoid 2 manipulation 2 memory-augmented-networks 2 ml 2 multi-camera 2 multilingual 2 object-detection 2 parameter-efficient-fine-tuning 2 prompting 2 real-time 2 regularization 2 relational-reasoning 2 residual-networks 2 rlhf 2 scaling-laws 2 segmentation 2 self-improvement 2 self-supervised-learning 2 state-space 2 systems 2 thermodynamics 2 vision-language-model 2 vision-transformer 2 visual-question-answering 2 zero-shot 2 3d 1 3d-scene 1 3d-semantic-occupancy 1 agenda 1 agentic 1 agi 1 algorithmic-information-theory 1 algorithmic-randomness 1 asynchronous 1 attention-mechanism 1 batch 1 bayesian-inference 1 behavior-forecasting 1 camera-fusion 1 classifier-guidance 1 combinatorial-optimization 1 comparison 1 compression 1 computability 1 concept 1 contrastive-learning 1 control 1 convolutional-neural-networks 1 corpus 1 course 1 data-collection 1 decoupled 1 deep-learning 1 denoising 1 depth-estimation 1 dexterous-manipulation 1 differentiable-programming 1 diffusion-policy 1 diffusion-transformer 1 dilated-convolutions 1 dropout 1 efficient 1 embodied-ai 1 embodiment 1 emergent-abilities 1 end-to-end-learning 1 evaluation-metric 1 few-shot 1 few-shot-learning 1 foundations 1 frontend 1 gaussian 1 gaussian-rendering 1 generalist-agent 1 generalization 1 gpu-training 1 graph-neural-networks 1 grounding 1 grpo 1 hierarchical 1 high-frequency-control 1 hosting 1 ilya 1 image-captioning 1 image-text-retrieval 1 in-context-learning 1 inductive-bias 1 intelligence-measurement 1 interactive-annotation 1 interactive-segmentation 1 knowledge-preservation 1 kolmogorov-complexity 1 lanegcn 1 locomotion 1 machine-translation 1 mamba 1 mdl 1 message-passing 1 minimum-description-length 1 model-parallelism 1 model-predictive-control 1 model-selection 1 modular 1 molecular-property-prediction 1 multi-embodiment 1 multi-task 1 natural-language 1 neural-radiance-fields 1 neuro-symbolic 1 obsidian 1 open-world 1 optimization 1 orchestration 1 parallel-architecture 1 parameter-efficient 1 permutation-invariance 1 personalization 1 physical-ai 1 pipeline-parallelism 1 pointer-mechanism 1 privileged-supervision 1 probabilistic-planning 1 proprioception 1 quantization 1 queue 1 radar 1 recurrent-neural-networks 1 representation-learning 1 scene-understanding 1 search 1 seminal 1 sensor-fusion 1 set-modeling 1 siamese-networks 1 simulator 1 source 1 sparse-models 1 spatial-reasoning 1 speech-recognition 1 survey 1 synthesis 1 taxonomy 1 temporal 1 temporal-modeling 1 thesis 1 tokenization 1 tool-use 1 training 1 uniad 1 unified-stack 1 vanishing-gradients 1 variational-autoencoders 1 video-generation 1 video-understanding 1 visual-traces 1 vit 1

Pages tagged autonomous-driving

Agent-Driver: A Language Agent for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2311.10813)** Agent-Driver reframes autonomous driving as a cognitive agent problem, positioning a large language model as the central reasoning and planning engine rather than…

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
source-summary

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Marco Pavone + 37 co-authors (NVIDIA), arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2511.00088)** Alpamayo-R1 is NVIDIA's production-grade Vision-Language-Action (…

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
source-summary

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang, arXiv, 2025. 📄 **[Read on arXiv](https://arxiv.org/abs/2503.07608)** AlphaDrive is the first application of GRPO (Group Relative Policy Optimization) reinforc…

Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.14556)** AsyncDriver addresses the practical deployment problem of LLM-enhanced driving planners: LLMs are too slow for frame-by-frame planning. The key insight is that hig…

Autovala Vision Language Action Model For End To End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2506.13757)** AutoVLA presents a unified approach to autonomous driving that integrates vision, language understanding, and action generation within a single autoregressive mode…

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
source-summary

**[Read on arXiv](https://arxiv.org/abs/2502.19694)** BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the l…

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

Bevnext Reviving Dense Bev Frameworks For 3D Object Detection
paper

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.14182)** BridgeAD tackles a critical limitation in end-to-end autonomous driving: the ineffective utilization of historical temporal information. Current systems either agg…

CARLA: An Open Urban Driving Simulator
source-summary

Dosovitskiy, Ros, Codevilla, Lopez, Koltun (Intel Labs / Toyota Research Institute / CVC Barcelona), CoRL, 2017. 📄 **[Read on arXiv](https://arxiv.org/abs/1711.03938)** CARLA (Car Learning to Act) is an open-source simu…

CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2502.19908) CarPlanner (Zhejiang University + Cainiao Network, CVPR 2025) introduces a consistent autoregressive reinforcement learning planner that is the first RL-based planner to…

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1812.03079)** Bansal, Krizhevsky, Ogale (Waymo Research), RSS, 2019. - [Paper](https://arxiv.org/abs/1812.03079) ChauffeurNet is Waymo's mid-level imitation learning system that…

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2411.15139) DiffusionDrive (HUST/Horizon Robotics, CVPR 2025 Highlight) proposes a truncated diffusion model for end-to-end autonomous driving that achieves real-time inference whil…

Dima Distilling Multi Modal Large Language Models For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.09757)** DiMA addresses the core tension in autonomous driving between vision-based planners (efficient but fragile on rare scenarios) and LLM-based approaches (strong reas…

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2309.10228)** Drive as You Speak (DAYS) proposes a framework for enabling natural language interaction between human passengers and autonomous vehicles using large language mode…

Drive-OccWorld: Driving in the Occupancy World
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2408.14197)** Drive-OccWorld introduces a vision-centric 4D occupancy forecasting world model that directly integrates with end-to-end planning. The core premise is that current…

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.00398)** DriveAdapter (Jia et al., ICCV 2023) identifies and addresses a fundamental structural problem in end-to-end autonomous driving: the tight coupling between percept…

DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2309.09777) DriveDreamer (ECCV 2024) is the first world model built entirely from real-world driving data, addressing fundamental limitations of prior approaches that relied on simu…

DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
source-summary

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, Hengshuang Zhao, IEEE Robotics and Automation Letters, 2024. 📄 **[Read on arXiv](https://arxiv.org/abs/2310.01412)** DriveGPT4 applie…

DriveGPT: Scaling Autoregressive Behavior Models for Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2412.14415) DriveGPT (Cruise, ICML 2025) is the first work to systematically study scaling laws for autoregressive behavior models in autonomous driving. Drawing inspiration from th…

DriveLM: Driving with Graph Visual Question Answering
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.14150)** DriveLM formalizes driving reasoning as Graph Visual Question Answering (GVQA), where QA pairs are connected via logical dependencies forming a reasoning graph tha…

DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09245)** DriveMLM proposes using a multimodal LLM as a plug-and-play behavioral planning module within existing autonomous driving stacks (Apollo, Autoware), rather than re…

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.16278)** DriveMoE introduces a dual-level Mixture-of-Experts (MoE) architecture to driving Vision-Language-Action models. The key innovation is applying expert specializati…

Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.07656)** DriveTransformer represents a fundamental departure from existing end-to-end autonomous driving approaches. Rather than following sequential perception-prediction-…

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2402.12289)** DriveVLM proposes a hierarchical approach to integrating Vision-Language Models into autonomous driving, emphasizing scene understanding and multi-level planning r…

Driving Gaussian Composite Gaussian Splatting For Surrounding Dynamic Driving Scenes
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07920)** DrivingGaussian addresses photorealistic 3D scene reconstruction for dynamic autonomous driving environments using Gaussian splatting. The core challenge is that d…

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2310.01957) Driving with LLMs (Wayve, ICRA 2024) is one of the first concrete demonstrations of using a large language model as the decision-making "brain" for autonomous driving. T…

DrivoR: Driving on Registers
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

EMMA: End-to-End Multimodal Model for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

End to End Learning for Self-Driving Cars
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1604.07316)** This paper from NVIDIA, commonly known as "DAVE-2" or the "NVIDIA end-to-end driving paper," demonstrates that a single convolutional neural network can learn to m…

End-to-end Driving via Conditional Imitation Learning
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1710.02410)** This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level comma…

Fb Bev Bev Representation From Forward Backward View Transformations
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12058)** Occupancy prediction has emerged as a powerful perception paradigm for autonomous driving, predicting per-voxel semantic labels in 3D space to handle arbitrary obj…

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2407.14108)** Bird's-eye view (BEV) semantic segmentation from multi-camera images is a core perception task in autonomous driving, but existing image-to-BEV transformation meth…

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.17288)** GaussianFlowOcc (ICCV 2025) introduces a transformative approach to 3D semantic occupancy estimation for autonomous driving by replacing traditional…

Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17429)** GaussianFormer introduces a fundamentally different scene representation for 3D semantic occupancy prediction: instead of dense voxel grids, scenes are modeled as…

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.04384)** GaussianFormer-2 addresses 3D semantic occupancy prediction for vision-centric autonomous driving by rethinking how 3D Gaussians represent occupied space. The origin…

GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01957)** Bird's-Eye View (BEV) perception faces a fundamental trade-off between accuracy and computational efficiency. High-performing 3D projection methods like BEVFormer…

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2408.11447)** GaussianOcc by Gan et al. (University of Tokyo / RIKEN / South China University of Technology / SIAT-CAS) is a systematic method that applies Gaussi…

Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.10373)** GaussianWorld introduces a world model paradigm for 3D occupancy prediction that explicitly models scene evolution over time, rather than treating frames as indepe…

GaussRender: Learning 3D Occupancy with Gaussian Rendering
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.05040)** GaussRender by Chambon et al. (Valeo AI / Sorbonne, ICCV 2025) introduces a plug-and-play training-time module that improves 3D occupancy prediction…

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.13193)** GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse…

Genad Generalized Predictive Model For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09630)** > **Note:** This is the CVPR 2024 Highlight paper on large-scale video prediction for driving, NOT the ECCV 2024 paper wiki/sources/papers/genad-generative-end-to-…

GenAD: Generative End-to-End Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2402.11502) GenAD (ECCV 2024) reframes end-to-end autonomous driving as a generative modeling problem, simultaneously generating future trajectories for all traffic participants rat…

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation
source-summary

[Read on arXiv](https://arxiv.org/abs/2503.05689) GoalFlow (Horizon Robotics / HKU, CVPR 2025) introduces a goal-driven flow matching framework for multimodal trajectory generation in autonomous driving. The method achi…

GPT-Driver: Learning to Drive with GPT
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.01415)** GPT-Driver reformulates autonomous driving motion planning as a language modeling problem. Scene context (object positions, velocities, lane geometry) and ego vehi…

Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.14729)** HERMES tackles a fundamental limitation in autonomous driving: existing systems treat 3D scene understanding and future scene generation as separate problems. Driv…

Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
paper

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.06978)** Hydra-MDP addresses a fundamental limitation of imitation learning for autonomous driving: standard behavior cloning learns only to mimic human demo…

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
source-summary

[Read on arXiv](https://arxiv.org/abs/2312.03031) This paper (CVPR 2024, NVIDIA / Nanjing University) delivers a "wake-up call" to the autonomous driving research community by demonstrating that simple baselines using o…

Languagempc Large Language Models As Decision Makers For Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2310.03026)** LanguageMPC addresses a fundamental limitation in autonomous driving: traditional planners (MPC, RL) struggle with complex scenarios that require high-level reason…

LAW: Enhancing End-to-End Autonomous Driving with Latent World Model
source-summary

[Read on arXiv](https://arxiv.org/abs/2406.08481) LAW (CASIA, ICLR 2025) introduces a self-supervised latent world model that enhances end-to-end autonomous driving by learning to predict future latent states of the dri…

Learning by Cheating
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1912.12294)** Learning by Cheating introduces a two-stage training paradigm for end-to-end autonomous driving that has become one of the most influential design patterns in the…

Learning Lane Graph Representations for Motion Forecasting
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2007.13732)** LaneGCN introduces a graph neural network architecture for motion forecasting in autonomous driving that operates directly on the lane graph structure of HD maps.…

Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2008.05711)** Lift, Splat, Shoot (LSS) introduced a differentiable pipeline for transforming multi-camera images into a unified bird's-eye view (BEV) representation without requ…

Lmdrive Closed Loop End To End Driving With Large Language Models
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.07488)** LMDrive is the first system to demonstrate and benchmark LLM-based driving in closed-loop simulation, introducing the LangAuto benchmark with ~64K instruction-foll…

Momad Momentum Aware Planning In End To End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.03125)** End-to-end autonomous driving systems suffer from a critical limitation: temporal inconsistency. Current systems operate in a "one-shot" manner, making trajectory…

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
paper

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2406.15349)** Autonomous vehicle evaluation has long been split between two unsatisfying extremes: open-loop metrics that replay logged trajectories and compare p…

Nuscenes A Multimodal Dataset For Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1903.11027)** nuScenes is a large-scale multimodal dataset for autonomous driving that provides synchronized data from 6 cameras (360-degree coverage), 1 LiDAR, 5 radars, GPS, a…

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.15014)** OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc,…

OccMamba: Semantic Occupancy Prediction with State Space Models
source-summary

**[Read on arXiv](https://arxiv.org/abs/2408.09859)** OccMamba is the first Mamba-based network for semantic occupancy prediction, replacing transformer architectures' quadratic complexity with Mamba's linear complexity…

Occworld Learning A 3D Occupancy World Model For Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.16038)** OccWorld introduces a generative world model that operates in 3D semantic occupancy space, jointly forecasting future scene evolution and ego vehicle trajectories.…

Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.23463)** OpenDriveVLA introduces a Vision-Language Action model specifically designed for end-to-end autonomous driving. Unlike previous approaches that use VLMs as supplem…

Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19755)** ORION bridges the reasoning-action gap in driving VLAs through a three-component architecture consisting of QT-Former (visual encoding), an LLM reasoning core, and…

PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving
source-summary

[Read on CVF Open Access](https://openaccess.thecvf.com/content/CVPR2024/html/Weng_PARA-Drive_Parallelized_Architecture_for_Real-time_Autonomous_Driving_CVPR_2024_paper.html) PARA-Drive (NVIDIA Research / USC / Stanford…

Planning-oriented Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.10156)** UniAD (Unified Autonomous Driving) is a planning-oriented end-to-end framework that unifies perception, prediction, and planning into a single differentiable netwo…

Pseudo-Simulation for Autonomous Driving (NAVSIM v2)
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2506.04218)** Pseudo-Simulation by Cao, Hallgarten et al. (Tubingen / Shanghai AI Lab / NVIDIA / Stanford, CoRL 2025) introduces a novel evaluation paradigm for a…

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.12725)** RaCFormer by Chu et al. (USTC, CVPR 2025) addresses a fundamental problem in radar-camera fusion for 3D object detection: the image-to-BEV transform…

Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2312.03661)** Reason2Drive provides the largest reasoning chain dataset for driving (>600K video-text pairs from nuScenes, Waymo, and ONCE) and introduces an aggregated evaluati…

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12754)** SelfOcc (Huang et al., Tsinghua University, CVPR 2024) introduces the first self-supervised framework for vision-based 3D occupancy prediction that works with mult…

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.22313)** Two dominant paradigms exist in autonomous driving: large vision-language models (LVLMs) with strong reasoning but poor trajectory precision, and end-to-end (E2E)…

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2503.09594)** Many driving VLM efforts improve language understanding (VQA, scene descriptions) but sacrifice actual driving performance. A model can correctly answer questions…

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
source-summary

[Read on arXiv](https://arxiv.org/abs/2505.16805) SOLVE proposes a synergistic framework that combines a Vision-Language Model (VLM) reasoning branch (SOLVE-VLM) with an end-to-end (E2E) driving network (SOLVE-E2E), con…

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2405.19620)** SparseDrive by Sun et al. (ICRA 2025) proposes a paradigm shift from dense BEV-based end-to-end driving to fully sparse scene representations. The c…

SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving
source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2603.29163)** SparseDriveV2 by Sun et al. (2026) pushes the performance boundary of scoring-based trajectory planning by demonstrating that "scoring is all you ne…

SparseOcc: Fully Sparse 3D Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.09551)** SurroundOcc addresses the problem of dense 3D semantic occupancy prediction from multi-camera images for autonomous driving. Unlike 3D object detection, which repr…

Talk2Car: Taking Control of Your Self-Driving Car
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1909.10838)** For autonomous vehicles to be truly useful as personal transportation, passengers should be able to issue natural-language commands like "park behind that blue car…

Talk2Drive Towards Personalized Autonomous Driving With Large Language Models
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.09397)** Talk2Drive introduces an LLM-based framework for personalized autonomous driving through natural language interaction, demonstrated in real-world field experiments…

Textual Explanations for Self-Driving Vehicles
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1807.11546)** End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et…

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.06242)** Think Twice (Jia et al., 2023) addresses a fundamental imbalance in end-to-end autonomous driving: while the community has invested heavily in sophisticated encode…

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.15997)** TransFuser (Chitta et al., 2022) is a foundational paper for transformer-based sensor fusion in end-to-end autonomous driving. The key problem it addresses is how…

VAD: Vectorized Scene Representation for Efficient Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2303.12077)** VAD (Vectorized Scene Representation for Efficient Autonomous Driving) by Jiang et al. (ICCV 2023) is a pivotal paper in the shift from dense rasterized scene repr…

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2402.13243)** VADv2 by Chen et al. (2024) is the successor to VAD, addressing a fundamental limitation of deterministic planners in autonomous driving: they output a single traj…

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2005.04259)** VectorNet (Gao et al., Waymo/Google, CVPR 2020) is a foundational paper that moved motion prediction and map encoding away from rasterized image-based representati…

Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability
paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17398)** Vista (NeurIPS 2024) is a generalizable driving world model that achieves high-fidelity video prediction at 10 Hz and 576x1024 resolution with versatile multi-moda…

VLP: Vision Language Planning for Autonomous Driving
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2401.05577)** VLP (Vision Language Planning) by Pan et al. (CVPR 2024) represents a fundamentally different approach to using language in autonomous driving compared to instruct…

WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01941)** End-to-end driving models typically output a single trajectory and trust it entirely, with no mechanism to evaluate whether the predicted path is safe before execu…