End-to-End Architectures
"End-to-end" is one of the most overloaded terms in autonomous driving. This page defines a clear taxonomy, traces the evolution of E2E systems, and maps the current landscape.
Taxonomy
The literature uses "end-to-end" to mean at least four distinct things. This wiki adopts the following classification:
Type 1: Direct perception-to-control
A single network maps raw sensor input directly to steering and throttle commands. No intermediate representations are exposed. The original NVIDIA self-driving paper End To End Learning For Self Driving Cars (2016) is the canonical example: a CNN maps front-camera images to steering angle. Simple and elegant but brittle, uninterpretable, and difficult to debug.
Type 2: Conditional imitation learning
The network maps sensor input to actions conditioned on a high-level command (turn left, go straight, follow lane). End To End Driving Via Conditional Imitation Learning (CIL, 2018) introduced this approach, showing that simple command conditioning dramatically improves navigation capability over unconditioned direct control. The command provides a minimal interface between route planning and low-level control.
Type 3: Jointly trained modular systems
The system preserves interpretable intermediate representations (3D detections, agent futures, BEV maps) but trains all modules jointly through a shared loss. Planning Oriented Autonomous Driving (UniAD, 2023) is the landmark example: it maintains explicit perception, prediction, and planning stages but optimizes them jointly with a planning-centric objective. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) follows the same philosophy with vectorized representations.
Type 4: VLA / foundation model systems
A large pretrained model (typically a VLM) processes sensor input and produces driving actions, often as tokens. The boundary between perception, prediction, and planning is dissolved into the model's internal representations. Emma End To End Multimodal Model For Autonomous Driving (EMMA) is the purest example, treating all outputs including trajectories as language tokens.
Historical evolution
The imitation learning era (2016--2020)
End To End Learning For Self Driving Cars demonstrated that CNNs can learn to steer from camera images alone. End To End Driving Via Conditional Imitation Learning added command conditioning. Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced data augmentation for robust imitation. Learning By Cheating established the privileged distillation paradigm that became standard on the Carla An Open Urban Driving Simulator benchmark.
A key lesson from this era: naive behavior cloning suffers from distributional drift. The agent encounters states not in the training distribution and compounds errors. DAgger-style approaches, data augmentation (ChauffeurNet), and privileged distillation (Learning by Cheating) are all responses to this fundamental problem.
The joint training era (2022--2024)
Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving (TransFuser) showed that transformer-based fusion of camera and LiDAR features enables effective end-to-end driving in CARLA. Planning Oriented Autonomous Driving (UniAD) demonstrated that joint training with interpretable intermediate supervision outperforms both fully modular and fully black-box approaches. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) showed the same approach works efficiently with vectorized representations.
The VLA era (2024--present)
The current wave applies foundation models to driving. Key architectural variants:
- Unified token models: Emma End To End Multimodal Model For Autonomous Driving encodes everything (perception queries, trajectory waypoints, scene descriptions) as tokens in a single VLM.
- Decoupled reasoning + planning: Senna Bridging Large Vision Language Models And End To End Autonomous Driving separates VLM reasoning from a lightweight E2E planner, preserving interpretability.
- Language-instructed action: Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation uses planning tokens to bridge VLM understanding and continuous action generation.
- Language as runtime interface: Lmdrive Closed Loop End To End Driving With Large Language Models accepts natural language navigation instructions at runtime. Simlingo Vision Only Closed Loop Autonomous Driving With Language Action Alignment aligns language and action representations for vision-only driving.
- Adaptive reasoning VLA: Autovala Vision Language Action Model For End To End Autonomous Driving dynamically switches between fast (direct action) and slow (chain-of-thought) reasoning based on scenario complexity, with RL fine-tuning.
- Parallel task transformer: Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving replaces sequential pipelines with parallel task processing through shared attention, achieving SOTA closed-loop performance with sparse queries.
- 3D-grounded VLA: Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model integrates hierarchical 3D scene queries (global, agent, map) into an LLM backbone, achieving SOTA at 0.5B scale.
- Distilled VLA: Dima Distilling Multi Modal Large Language Models For Autonomous Driving jointly trains a vision planner with an MLLM, then discards the LLM at inference -- achieving 80% collision reduction with zero inference overhead.
- Explanation-oriented: Drivegpt4 Interpretable End To End Autonomous Driving Via Large Language Model and Gpt Driver Learning To Drive With Gpt use LLMs primarily for generating interpretable driving explanations and plans.
- Structured reasoning: Drivelm Driving With Graph Visual Question Answering uses graph-structured QA to decompose driving into interpretable reasoning chains. Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving applies chain-of-thought reasoning to driving decisions.
Design trade-offs
| Trade-off | Type 1-2 (Direct/CIL) | Type 3 (Joint modular) | Type 4 (VLA) |
|---|---|---|---|
| Interpretability | None / minimal | High (explicit intermediates) | Variable (depends on architecture) |
| Debugging | Difficult | Module-level | Difficult |
| Data efficiency | Low | Moderate | High (pretrained backbone) |
| Benchmark performance | Moderate | Strong | State-of-the-art |
| Real-time capable | Yes | Yes | Challenging (large models) |
| Safety certification | Very difficult | Tractable | Very difficult |
Present state and open problems
- Unified vs. decoupled: Whether fully unified systems (EMMA) or decoupled reasoning + planning (Senna) will dominate is the field's central architectural question. Unified is simpler; decoupled is more interpretable and potentially safer.
- Intermediate supervision: Type 3 systems use explicit intermediate supervision; Type 4 systems largely do not. Whether intermediate supervision is a necessary scaffold or an unnecessary constraint is debated.
- Closed-loop competence: Many E2E systems are evaluated only open-loop. The gap between open-loop metrics and closed-loop driving competence remains large and poorly understood.
- Latency: VLA models with billions of parameters struggle to meet real-time requirements. Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving demonstrates 99ms inference but required significant engineering. Dima Distilling Multi Modal Large Language Models For Autonomous Driving offers an alternative: distill and discard the LLM entirely.
- Temporal consistency: Momad Momentum Aware Planning In End To End Autonomous Driving shows that E2E planners suffer from temporal inconsistency, producing jittery trajectories. Momentum-aware planning addresses this with trajectory and perception momentum.
- Safety verification: E2E systems resist formal verification. Combining learned E2E planners with verifiable safety layers is an active area. Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model offers one approach through world-model-based trajectory checking.