ESC

End-to-End Architectures

"End-to-end" is one of the most overloaded terms in autonomous driving. This page defines a clear taxonomy, traces the evolution of E2E systems, and maps the current landscape.

Taxonomy

The literature uses "end-to-end" to mean at least four distinct things. This wiki adopts the following classification:

Type 1: Direct perception-to-control

A single network maps raw sensor input directly to steering and throttle commands. No intermediate representations are exposed. The original NVIDIA self-driving paper End To End Learning For Self Driving Cars (2016) is the canonical example: a CNN maps front-camera images to steering angle. Simple and elegant but brittle, uninterpretable, and difficult to debug.

Type 2: Conditional imitation learning

The network maps sensor input to actions conditioned on a high-level command (turn left, go straight, follow lane). End To End Driving Via Conditional Imitation Learning (CIL, 2018) introduced this approach, showing that simple command conditioning dramatically improves navigation capability over unconditioned direct control. The command provides a minimal interface between route planning and low-level control.

Type 3: Jointly trained modular systems

The system preserves interpretable intermediate representations (3D detections, agent futures, BEV maps) but trains all modules jointly through a shared loss. Planning Oriented Autonomous Driving (UniAD, 2023) is the landmark example: it maintains explicit perception, prediction, and planning stages but optimizes them jointly with a planning-centric objective. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) follows the same philosophy with vectorized representations.

Type 4: VLA / foundation model systems

A large pretrained model (typically a VLM) processes sensor input and produces driving actions, often as tokens. The boundary between perception, prediction, and planning is dissolved into the model's internal representations. Emma End To End Multimodal Model For Autonomous Driving (EMMA) is the purest example, treating all outputs including trajectories as language tokens.

Historical evolution

The imitation learning era (2016--2020)

End To End Learning For Self Driving Cars demonstrated that CNNs can learn to steer from camera images alone. End To End Driving Via Conditional Imitation Learning added command conditioning. Chauffeurnet Learning To Drive By Imitating The Best And Synthesizing The Worst introduced data augmentation for robust imitation. Learning By Cheating established the privileged distillation paradigm that became standard on the Carla An Open Urban Driving Simulator benchmark.

A key lesson from this era: naive behavior cloning suffers from distributional drift. The agent encounters states not in the training distribution and compounds errors. DAgger-style approaches, data augmentation (ChauffeurNet), and privileged distillation (Learning by Cheating) are all responses to this fundamental problem.

The joint training era (2022--2024)

Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving (TransFuser) showed that transformer-based fusion of camera and LiDAR features enables effective end-to-end driving in CARLA. Planning Oriented Autonomous Driving (UniAD) demonstrated that joint training with interpretable intermediate supervision outperforms both fully modular and fully black-box approaches. Vad Vectorized Scene Representation For Efficient Autonomous Driving (VAD) showed the same approach works efficiently with vectorized representations.

The VLA era (2024--present)

The current wave applies foundation models to driving. Key architectural variants:

Design trade-offs

Trade-off Type 1-2 (Direct/CIL) Type 3 (Joint modular) Type 4 (VLA)
Interpretability None / minimal High (explicit intermediates) Variable (depends on architecture)
Debugging Difficult Module-level Difficult
Data efficiency Low Moderate High (pretrained backbone)
Benchmark performance Moderate Strong State-of-the-art
Real-time capable Yes Yes Challenging (large models)
Safety certification Very difficult Tractable Very difficult

Present state and open problems

  • Unified vs. decoupled: Whether fully unified systems (EMMA) or decoupled reasoning + planning (Senna) will dominate is the field's central architectural question. Unified is simpler; decoupled is more interpretable and potentially safer.
  • Intermediate supervision: Type 3 systems use explicit intermediate supervision; Type 4 systems largely do not. Whether intermediate supervision is a necessary scaffold or an unnecessary constraint is debated.
  • Closed-loop competence: Many E2E systems are evaluated only open-loop. The gap between open-loop metrics and closed-loop driving competence remains large and poorly understood.
  • Latency: VLA models with billions of parameters struggle to meet real-time requirements. Alpamayo R1 Bridging Reasoning And Action Prediction For Autonomous Driving demonstrates 99ms inference but required significant engineering. Dima Distilling Multi Modal Large Language Models For Autonomous Driving offers an alternative: distill and discard the LLM entirely.
  • Temporal consistency: Momad Momentum Aware Planning In End To End Autonomous Driving shows that E2E planners suffer from temporal inconsistency, producing jittery trajectories. Momentum-aware planning addresses this with trajectory and perception momentum.
  • Safety verification: E2E systems resist formal verification. Combining learned E2E planners with verifiable safety layers is an active area. Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model offers one approach through world-model-based trajectory checking.

Key papers

Paper Contribution
End To End Learning For Self Driving Cars Direct perception-to-steering CNN
End To End Driving Via Conditional Imitation Learning Command-conditioned imitation learning
Learning By Cheating Privileged distillation paradigm
Planning Oriented Autonomous Driving UniAD: jointly trained modular E2E
Vad Vectorized Scene Representation For Efficient Autonomous Driving Vectorized joint E2E
Vadv2 End To End Vectorized Autonomous Driving Via Probabilistic Planning Probabilistic vectorized E2E with action vocabulary
Transfuser Imitation With Transformer Based Sensor Fusion For Autonomous Driving Transformer sensor fusion for E2E
Emma End To End Multimodal Model For Autonomous Driving Everything-as-tokens VLA
Senna Bridging Large Vision Language Models And End To End Autonomous Driving Decoupled VLM reasoning + E2E planning
Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation Vision-language-instructed action generation
Autovala Vision Language Action Model For End To End Autonomous Driving Adaptive dual-process VLA with RL
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving Parallel-task sparse transformer E2E
S4 Driver Scalable Self Supervised Driving Mllm With Spatio Temporal Visual Representation Self-supervised MLLM E2E without annotations
Bridgead Bridging Past And Future End To End Autonomous Driving With Historical Prediction History-enhanced jointly trained E2E (Type 3)
Drive Occworld Driving In The Occupancy World World-model-augmented E2E with occupancy planning
Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model 3D-grounded open-source VLA
Dima Distilling Multi Modal Large Language Models For Autonomous Driving MLLM distillation for efficient E2E
Momad Momentum Aware Planning In End To End Autonomous Driving Momentum-aware temporal consistency
Sparsedrive End To End Autonomous Driving Via Sparse Scene Representation Fully sparse E2E with parallel prediction-planning
Sparsedrivev2 End To End Autonomous Driving Via Sparse Scene Representation Factorized trajectory vocabulary scoring, 92.0 PDMS
Navsim V2 Pseudo Simulation For Autonomous Driving Pseudo-simulation evaluation benchmark (CoRL 2025)
Think Twice Before Driving Towards Scalable Decoders For End To End Autonomous Driving Scalable cascaded decoder for E2E, decoder depth as scaling axis
Driveadapter Breaking The Coupling Barrier Of Perception And Planning In End To End Autonomous Driving Decoupled perception-planning via adapter, plug-and-play modularity
Hydra Mdp End To End Multimodal Planning With Multi Target Hydra Distillation Multi-target distillation with vocabulary-based planning, NAVSIM winner