End-to-End Architectures

"End-to-end" is one of the most overloaded terms in autonomous driving. This page defines a clear taxonomy, traces the evolution of E2E systems, and maps the current landscape.

Taxonomy

The literature uses "end-to-end" to mean at least four distinct things. This wiki adopts the following classification:

Type 1: Direct perception-to-control

A single network maps raw sensor input directly to steering and throttle commands. No intermediate representations are exposed. The original NVIDIA self-driving paper End To End Learning For Self Driving Cars (2016) is the canonical example: a CNN maps front-camera images to steering angle. Simple and elegant but brittle, uninterpretable, and difficult to debug.

Type 2: Conditional imitation learning

The network maps sensor input to actions conditioned on a high-level command (turn left, go straight, follow lane). End To End Driving Via Conditional Imitation Learning (CIL, 2018) introduced this approach, showing that simple command conditioning dramatically improves navigation capability over unconditioned direct control. The command provides a minimal interface between route planning and low-level control.

Type 3: Jointly trained modular systems

The system preserves interpretable intermediate representations (3D detections, agent futures, BEV maps) but trains all modules jointly through a shared loss. Planning Oriented Autonomous Driving (UniAD, 2023) is the landmark example: it maintains explicit perception, prediction, and planning stages but optimizes them jointly with a planning-centric objective. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) follows the same philosophy with vectorized representations.

Type 4: VLA / foundation model systems

A large pretrained model (typically a VLM) processes sensor input and produces driving actions, often as tokens. The boundary between perception, prediction, and planning is dissolved into the model's internal representations. Emma End To End Multimodal Model For Autonomous Driving (EMMA) is the purest example, treating all outputs including trajectories as language tokens.

Historical evolution

The imitation learning era (2016--2020)

End To End Learning For Self Driving Cars demonstrated that CNNs can learn to steer from camera images alone. End To End Driving Via Conditional Imitation Learning added command conditioning. Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced data augmentation for robust imitation. Learning By Cheating established the privileged distillation paradigm that became standard on the Carla An Open Urban Driving Simulator benchmark.

A key lesson from this era: naive behavior cloning suffers from distributional drift. The agent encounters states not in the training distribution and compounds errors. DAgger-style approaches, data augmentation (ChauffeurNet), and privileged distillation (Learning by Cheating) are all responses to this fundamental problem.

The joint training era (2022--2024)

Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving (TransFuser) showed that transformer-based fusion of camera and LiDAR features enables effective end-to-end driving in CARLA. Planning Oriented Autonomous Driving (UniAD) demonstrated that joint training with interpretable intermediate supervision outperforms both fully modular and fully black-box approaches. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) showed the same approach works efficiently with vectorized representations.

The VLA era (2024--present)

The current wave applies foundation models to driving. Key architectural variants:

Unified token models: Emma End To End Multimodal Model For Autonomous Driving encodes everything (perception queries, trajectory waypoints, scene descriptions) as tokens in a single VLM.
Decoupled reasoning + planning: Senna Bridging Large Vision Language Models And End To End Autonomous Driving separates VLM reasoning from a lightweight E2E planner, preserving interpretability.
Language-instructed action: Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation uses planning tokens to bridge VLM understanding and continuous action generation.
Language as runtime interface: Lmdrive Closed Loop End To End Driving With Large Language Models accepts natural language navigation instructions at runtime. Simlingo Vision Only Closed Loop Autonomous Driving With Language Action Alignment aligns language and action representations for vision-only driving.
Adaptive reasoning VLA: Autovala Vision Language Action Model For End To End Autonomous Driving dynamically switches between fast (direct action) and slow (chain-of-thought) reasoning based on scenario complexity, with RL fine-tuning.
Parallel task transformer: Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving replaces sequential pipelines with parallel task processing through shared attention, achieving SOTA closed-loop performance with sparse queries.
3D-grounded VLA: Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model integrates hierarchical 3D scene queries (global, agent, map) into an LLM backbone, achieving SOTA at 0.5B scale.
Distilled VLA: Dima Distilling Multi Modal Large Language Models For Autonomous Driving jointly trains a vision planner with an MLLM, then discards the LLM at inference -- achieving 80% collision reduction with zero inference overhead.
Explanation-oriented: Drivegpt4 Interpretable End To End Autonomous Driving Via Large Language Model and Gpt Driver Learning To Drive With Gpt use LLMs primarily for generating interpretable driving explanations and plans.
Structured reasoning: Drivelm Driving With Graph Visual Question Answering uses graph-structured QA to decompose driving into interpretable reasoning chains. Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving applies chain-of-thought reasoning to driving decisions.

Design trade-offs

Trade-off	Type 1-2 (Direct/CIL)	Type 3 (Joint modular)	Type 4 (VLA)
Interpretability	None / minimal	High (explicit intermediates)	Variable (depends on architecture)
Debugging	Difficult	Module-level	Difficult
Data efficiency	Low	Moderate	High (pretrained backbone)
Benchmark performance	Moderate	Strong	State-of-the-art
Real-time capable	Yes	Yes	Challenging (large models)
Safety certification	Very difficult	Tractable	Very difficult

Present state and open problems

Unified vs. decoupled: Whether fully unified systems (EMMA) or decoupled reasoning + planning (Senna) will dominate is the field's central architectural question. Unified is simpler; decoupled is more interpretable and potentially safer.
Intermediate supervision: Type 3 systems use explicit intermediate supervision; Type 4 systems largely do not. Whether intermediate supervision is a necessary scaffold or an unnecessary constraint is debated.
Closed-loop competence: Many E2E systems are evaluated only open-loop. The gap between open-loop metrics and closed-loop driving competence remains large and poorly understood.
Latency: VLA models with billions of parameters struggle to meet real-time requirements. Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving demonstrates 99ms inference but required significant engineering. Dima Distilling Multi Modal Large Language Models For Autonomous Driving offers an alternative: distill and discard the LLM entirely.
Temporal consistency: Momad Momentum Aware Planning In End To End Autonomous Driving shows that E2E planners suffer from temporal inconsistency, producing jittery trajectories. Momentum-aware planning addresses this with trajectory and perception momentum.
Safety verification: E2E systems resist formal verification. Combining learned E2E planners with verifiable safety layers is an active area. Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model offers one approach through world-model-based trajectory checking.

Key papers

Paper	Contribution
End To End Learning For Self Driving Cars	Direct perception-to-steering CNN
End To End Driving Via Conditional Imitation Learning	Command-conditioned imitation learning
Learning By Cheating	Privileged distillation paradigm
Planning Oriented Autonomous Driving	UniAD: jointly trained modular E2E
Vad Vectorized Scene Representation For Efficient Autonomous Driving	Vectorized joint E2E
Vadv2 End To End Vectorized Autonomous Driving Via Probabilistic Planning	Probabilistic vectorized E2E with action vocabulary
Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving	Transformer sensor fusion for E2E
Emma End To End Multimodal Model For Autonomous Driving	Everything-as-tokens VLA
Senna Bridging Large Vision Language Models And End To End Autonomous Driving	Decoupled VLM reasoning + E2E planning
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation	Vision-language-instructed action generation
Autovala Vision Language Action Model For End To End Autonomous Driving	Adaptive dual-process VLA with RL
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving	Parallel-task sparse transformer E2E
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation	Self-supervised MLLM E2E without annotations
Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction	History-enhanced jointly trained E2E (Type 3)
Drive Occworld Driving In The Occupancy World	World-model-augmented E2E with occupancy planning
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model	3D-grounded open-source VLA
Dima Distilling Multi Modal Large Language Models For Autonomous Driving	MLLM distillation for efficient E2E
Momad Momentum Aware Planning In End To End Autonomous Driving	Momentum-aware temporal consistency
Sparsedrive End To End Autonomous Driving Via Sparse Scene Representation	Fully sparse E2E with parallel prediction-planning
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation	Factorized trajectory vocabulary scoring, 92.0 PDMS
Navsim V2 Pseudo Simulation For Autonomous Driving	Pseudo-simulation evaluation benchmark (CoRL 2025)
Think Twice Before Driving Towards Scalable Decoders For End To End Autonomous Driving	Scalable cascaded decoder for E2E, decoder depth as scaling axis
Driveadapter Breaking The Coupling Barrier Of Perception And Planning In End To End Autonomous Driving	Decoupled perception-planning via adapter, plug-and-play modularity
Hydra Mdp End To End Multimodal Planning With Multi Target Hydra Distillation	Multi-target distillation with vocabulary-based planning, NAVSIM winner