π Read on arXiv
Textual Explanations for Self-Driving Vehicles
Overview
End-to-end driving models produce control signals without any rationale, making them opaque and untrustworthy for safety-critical deployment. This paper by Kim et al. (ECCV 2018) addresses this by creating BDD-X -- the first large-scale benchmark for explainable autonomous driving -- and proposing an attention-alignment mechanism that grounds natural-language driving explanations in the same visual regions that influenced the control decision.
The key philosophical contribution is the distinction between introspective explanations and post-hoc rationalizations. An introspective explanation is causally related to the decision process (the model explains what it actually attended to), while a rationalization merely generates plausible text about the scene that may have no connection to what drove the decision. The attention alignment loss explicitly encourages the former: the explanation generator's attention map is regularized to match the controller's spatial attention map, creating a causal link between what the model looks at for control and what it talks about in its explanation.
BDD-X became the standard benchmark for explainable driving research, directly reused by DriveGPT4, DriveLM, and many subsequent works. The attention-alignment paradigm established here remains influential, though modern VLA systems have shifted from separate controller+explainer architectures toward unified models where language and action share the same representation, making the faithfulness question even more central.
Key Contributions
- BDD-X dataset: Built on BDD100K with 6,984 video clips (over 77 hours, over 26K annotations) paired with human-written action descriptions ("car is slowing down") and justifications ("because a pedestrian is crossing"), establishing the standard testbed for explainable driving
- Dual-model architecture: Visual attention-based controller (images to steering/acceleration with spatial attention maps) coupled with a video-to-text explanation generator (encoder-decoder with temporal and spatial attention)
- Two attention alignment strategies: Strongly Aligned Attention (SAA) directly reuses the controller's attention map in the explanation generator; Weakly Aligned Attention (WAA) trains a separate attention mechanism in the explainer but constrains it via a KL-divergence loss toward the controller's attention map. WAA achieves the best explanation quality in both automatic metrics and human evaluations.
- Introspective vs rationalization distinction: Formally separates causally-grounded explanations from post-hoc plausible-but-not-causal text generation, a distinction that remains philosophically important in 2025 VLA research
- First language-output driving system with explicit visual grounding: Prior work generated explanations without any mechanism to ensure they were related to the decision process
Architecture / Method
BDD-X: Attention-Aligned Controller + Explainer
Video Frames [f1, f2, ..., fT]
β
ββββββββββββββββββββββββββββββββββββββββ
βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ
β Vehicle Controllerβ β Explanation Generatorβ
β β β β
β CNN (5-layer) β β CNN Encoder β
β β β β β β
β βΌ β β βΌ β
β Spatial Attention β β LSTM Decoder β
β A_ctrl β β + Attention A_expl β
β β β β β β
β βΌ β β βΌ β
β LSTM βββΊ Control β β Word-by-word text β
β (steer, speed) β β "stopping because β
ββββββββββ¬βββββββββββ β of pedestrian" β
β ββββββββββββ¬βββββββββββ
β β
β ββββββββββββββββββββββββββ β
ββββββββββΊβ Alignment (two modes): βββ
β SAA: A_expl = A_ctrl β
β WAA: L_align = β
β KL(A_ctrl || A_expl) β
ββββββββββββββββββββββββββ
Total Loss: L = L_control + Ξ»_text * L_text + Ξ»_align * L_align
The system has two components trained jointly. The vehicle controller takes a sequence of frames and produces steering angle and speed predictions. It uses a five-layer CNN (without max-pooling, similar to Bojarski et al.) with spatial attention to produce attention-weighted visual features, which are processed by an LSTM to generate control signals. The spatial attention map A_ctrl indicates which image regions the controller relies on for its decisions.
The explanation generator is an encoder-decoder model. The encoder processes the same video frames through a CNN, and the decoder generates natural-language explanations word by word using an LSTM with attention over the visual features. The decoder's attention map A_expl indicates which image regions the explanation references.
The paper introduces two attention alignment strategies. In Strongly Aligned Attention (SAA), the explanation generator directly reuses the controller's attention map, guaranteeing the two components look at identical regions. In Weakly Aligned Attention (WAA), the explanation generator has its own separate attention network but is guided by a KL-divergence loss L_align = KL(A_ctrl || A_expl), which encourages alignment while preserving flexibility. WAA outperforms SAA in both automatic NLG metrics and human evaluations. The total loss is L = L_control + lambda_text * L_text + lambda_align * L_align, where L_control is the driving loss (MSE on steering/speed), L_text is the language generation loss (cross-entropy), and L_align is the attention alignment regularizer (KL divergence in the WAA variant).
This architecture ensures that when the explanation says "stopping because of pedestrian on right," the explanation generator is actually attending to the pedestrian on the right, and that region is also what the controller used for its braking decision.
Results
- WAA achieves best explanation quality: in human evaluation, the Weakly Aligned Attention model achieved 66.0% correctness for explanations and 93.5% for descriptions, outperforming both SAA and the rationalization baseline
- BDD-X provides a reusable benchmark that became the de facto testbed for explainable driving research, directly used by DriveGPT4, Reason2Drive, and others
- Explanations are grounded in controller attention regions, providing evidence that the language output is connected to the actual decision process rather than arbitrary scene narration
- Standard NLG metrics (BLEU, METEOR, CIDEr) show improvements with attention alignment, though the human evaluation results are more meaningful for assessing faithfulness
- Qualitative examples demonstrate that aligned explanations correctly identify the causal factors (pedestrians, traffic lights, lead vehicles) while unaligned explanations sometimes reference irrelevant scene elements
Limitations & Open Questions
- Aligned attention does not definitively prove causal correctness -- the faithfulness problem means aligned language can still be plausible yet not truly decision-causing, since the alignment is a soft regularizer rather than a hard guarantee
- Offline evaluation only with no closed-loop safety validation of the driving component; the control model is simple and not competitive with modern driving systems
- Language quality is limited by template-like descriptions in BDD-X, which constrains the diversity and naturalness of generated explanations
- Language is output only, not input -- no instruction following capability, limiting the system to explanation rather than interaction
Connections
- Autonomous Driving
- Vision Language Action
- Drivegpt4 Interpretable End To End Autonomous Driving Via Large Language Model
- Simlingo Vision Only Closed Loop Autonomous Driving With Language Action Alignment
- Drivelm Driving With Graph Visual Question Answering
- Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
- Talk2Car Taking Control Of Your Self Driving Car