End-to-end Driving via Conditional Imitation Learning
π Read on arXiv
Overview
This paper introduces conditional imitation learning for end-to-end autonomous driving, where a neural network policy is conditioned on a discrete high-level command (turn left, turn right, go straight, follow lane) to resolve the fundamental multimodal action ambiguity at intersections. The same visual scene at an intersection can correspond to three or more correct actions depending on navigation intent, and without conditioning on intent, a learned policy will average these modes and produce erratic or incorrect behavior.
The key insight is that perception alone is insufficient for determining the correct action at decision points. The model takes three inputs: a front-facing camera image, low-dimensional measurements (primarily vehicle speed), and a categorical high-level command. It outputs a continuous 2D control vector: steering angle and acceleration (combined throttle/brake). By adding command conditioning, the policy becomes controllable and responsive to navigation intent. The paper introduces a branched architecture with separate fully-connected heads per command, gated by command selection, which outperforms the naive approach of concatenating the command with visual features. This conditional policy formulation -- F(i, m, c) = A_c(J(i, m)) -- became a foundational design pattern for the field.
This paper is foundational for the entire VLA driving research trajectory. The branched command-conditioned architecture became a durable design pattern that persists in modern VLA models, with the discrete four-word command vocabulary simply replaced by natural language instructions. The progression from CIL's discrete commands to LMDrive's natural language to EMMA's full prompt-driven interface is a direct evolutionary line, making this paper the proto-VLA for autonomous driving.
Key Contributions
- Branched command-conditioned architecture: Separate FC heads per high-level command (left/right/straight/follow), gated by command selection, outperforms naive command-input concatenation
- Conditional policy formulation: F(i, m, c) = A_c(J(i, m)), where the CNN encoder and measurement MLP jointly produce features J(i, m), and the command-specific branch A_c produces a 2D control output (steering + acceleration) -- the first clean formalization of intent-conditioned driving with multi-modal inputs
- Data augmentation with noise injection to handle the distribution shift problem inherent to imitation learning, where the policy encounters states not seen during expert demonstration
- Dual validation in simulation and real world: Evaluated in CARLA simulator and on a physical 1/5-scale truck in residential environments
- Identification of the multimodal output problem: Clearly demonstrated that identical visual inputs at intersections cause oscillation and wrong turns without intent conditioning
Architecture / Method
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Branched Conditional Imitation Learning β
β β
β βββββββββββββββββ βββββββββββββββββ β
β β Front Camera β β Measurements β β
β β Image (i) β β m (speed...) β β
β βββββββββ¬ββββββββ βββββββββ¬ββββββββ β
β βΌ βΌ β
β βββββββββββββββββ βββββββββββββββββ β
β β 8-layer CNN β β MLP module β β
β β I(i) β β M(m) β β
β βββββββββ¬ββββββββ βββββββββ¬ββββββββ β
β βββββββββββ¬ββββββββββ β
β βΌ β
β βββββββββββββββββββββ β
β β Joint repr J(i,m) β β
β βββββββββββ¬ββββββββββ β
β β β
β βββββββββββ΄βββββββββββββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β A_left β β A_right β β A_str. β β A_follow β β
β β Turn Left β β T. Right β β Straight β β Lane β β
β β FC Branch β β FC Branchβ β FC Branchβ β FC Branchβ β
β ββββββββ¬βββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β βββββββββββββββ΄ββββ c ββββββ΄ββββββββββββββ β
β βΌ β
β (command selects branch) β
β βΌ β
β βββββββββββββββββββββββββββββ β
β β Output: steering + accel β β
β β (2D continuous action) β β
β βββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ


The architecture consists of two main components: a perception/measurement module and a set of command-conditional control modules. The perception module is an 8-layer convolutional network that takes a single front-facing camera image and extracts visual features. In addition to the image, the network also ingests low-dimensional measurements -- most importantly vehicle speed -- through a separate MLP. The visual features and measurement features are jointly combined into a shared representation J(i, m) before being passed to the command-conditional branches.
The control module uses a branched design. Rather than a single output head, there are separate fully-connected network branches for each high-level command: turn left, turn right, go straight, and follow lane. At inference time, the navigation system provides the current command c, and only the corresponding branch f_c is activated to produce the control output: a 2D action vector consisting of steering angle and a combined acceleration value (throttle/brake). The full policy is F(i, m, c) = A_c(J(i, m)). This branching allows each command-specific head to specialize in the action distribution for that maneuver type, avoiding the averaging problem that occurs when all maneuvers share parameters.
The paper compares this branched architecture against two alternatives: (1) an unconditional model with no command input, which must handle all maneuvers with a single output head, and (2) a command-input model that concatenates a one-hot command encoding with the visual features before a shared output head. The branched design outperforms both.
Training uses standard behavioral cloning on expert demonstrations from CARLA, with two key augmentation strategies. First, viewpoint perturbation: cameras are placed at offset positions during data collection, with labels adjusted to steer back toward the center, creating recovery examples. Second, noise injection: small random perturbations are applied to the expert's controls during data collection, creating slightly off-policy states that the model learns to correct from.
Results

CARLA Simulator Performance
| Method | Town 1 Success | Town 2 Success |
|---|---|---|
| Branched conditional model | 88% | 64% |
| Command input architecture | 78% | 52% |
| Non-conditional baseline | 20% | 26% |
Physical Robot Tests
| Configuration | Missed Turns | Human Interventions/Run |
|---|---|---|
| Branched model | 0% | 0.67 |
| Without noise injection | 24.4% | 8.67 |
| Without data augmentation | 73% | 39 |
- Conditional model significantly outperforms unconditional baselines at intersections in CARLA simulation, resolving the oscillation and wrong-turn failure modes that plague unconditional policies
- Branched architecture outperforms command-input architecture: separate heads per command allow each branch to specialize in one maneuver type, yielding cleaner action predictions with higher success rates
- Real-world transfer demonstrated: successfully trained a 1/5-scale truck to drive in residential environments following high-level commands, showing the approach is not limited to simulation
- Noise injection mitigates distribution shift: adding perturbation noise during training substantially improves robustness during autonomous execution, with the model learning to recover from small deviations
- Ablation studies confirm that both the branched architecture and noise injection are independently important, with their combination providing the best performance
Limitations & Open Questions
- The "language" interface is a predefined 4-word vocabulary, not free-form natural language -- the gap from discrete commands to natural language instructions remained open for years
- Inherits fundamental imitation learning distribution shift and generalization issues despite noise injection -- compounding errors in long-horizon driving remain problematic
- No reasoning or explanation capability -- the policy is a black box that maps (image, command) to controls, providing no interpretability