BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Overview

BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the learning process itself. Rather than designing better BEV encoders, BEVDiffuser takes the novel approach of denoising existing BEV features using a conditional diffusion model guided by ground-truth object layouts.

The critical design insight is that BEVDiffuser operates only during training: it provides denoised BEV features as supervision targets for the base model, then is completely removed at inference. This means zero additional computational overhead at deployment while the base model permanently benefits from training against cleaner targets. The approach is architecture-agnostic -- it works as a plug-and-play module for any BEV detector.

Applied to BEVFormer-tiny, BEVDiffuser yields 12.3% mAP improvement and 10.1% NDS gain on nuScenes 3D object detection, with particularly dramatic improvements on long-tail objects (24-30% mAP gains on construction vehicles and buses) and challenging conditions (20-29% mAP improvement in night scenarios).

Key Contributions

Training-only diffusion denoising: A diffusion model that enhances BEV features during training but is removed at inference, achieving zero-overhead improvement
Ground-truth layout guidance: Uses clean object layouts (category IDs + 3D bounding boxes) as conditioning instead of noisy learned features, providing an oracle-quality denoising target
Universal plug-and-play: Works across different BEV architectures (BEVFormer, BEVFormerV2, BEVFusion) without requiring any architectural modifications
Long-tail and adverse condition robustness: Dramatically improves detection of rare objects and performance in night/weather scenarios

Architecture / Method

BEV feature map comparison showing noise reduction

BEVDiffuser architecture

Plug-and-play training scheme

┌──────────────────── TRAINING ONLY ────────────────────────────┐
│                                                               │
│  Multi-Camera        GT Layout (cat ID + 3D bbox)             │
│  Images                    │                                  │
│    │                       ▼                                  │
│    ▼              ┌────────────────┐                           │
│  ┌──────────┐     │ Layout Encoder │                           │
│  │ BEV      │     │ (global scene  │                           │
│  │ Encoder  │     │  + per-object) │                           │
│  │ (any)    │     └───────┬────────┘                           │
│  └────┬─────┘             │                                   │
│       │                   ▼                                   │
│       │         ┌───────────────────┐                          │
│  Noisy BEV ───► │    BEVDiffuser    │                          │
│  features x_t0  │  (Conditional     │                          │
│       │         │   Diffusion U-Net │                          │
│       │         │   + Cross-Attn)   │                          │
│       │         └────────┬──────────┘                          │
│       │                  │                                    │
│       │           Denoised BEV                                │
│       │                  │                                    │
│       ▼                  ▼                                    │
│  ┌──────────────────────────────┐                              │
│  │  L_total = L_diffusion(MSE)  │                              │
│  │          + λ·L_task          │                              │
│  │  L_BEV   = L_task            │                              │
│  │          + α·L(denoised,orig)│                              │
│  └──────────────────────────────┘                              │
│                                                               │
├──────────────────── INFERENCE ────────────────────────────────┤
│                                                               │
│  Multi-Camera ──► BEV Encoder ──► Detection Head              │
│  (BEVDiffuser completely removed, zero overhead)              │
└───────────────────────────────────────────────────────────────┘

Core Components: - Conditional diffusion model: U-Net with transformer-based layout fusion - Ground-truth layout representation: Objects encoded as sets of (category ID, normalized 3D bounding box coordinates) - Dual-level conditioning: (1) Global scene context via a "virtual unit cube" object representing the full scene, (2) Local object-aware information through cross-attention between U-Net features and individual object encodings

Training Strategy: 1. BEVDiffuser targets the initial encoder BEV features (x_t0) as the denoising objective 2. Joint loss: L_total = L_diffusion(MSE) + lambda * L_task 3. Base model supervised via denoised outputs: L_BEV = L_task + alpha * L_BEV(denoised vs original)

Inference: BEVDiffuser is completely removed. The enhanced base model operates with its original architecture and computational cost.

Generative capability: BEVDiffuser can also generate BEV features from pure noise conditioned on layouts, achieving 41.1% NDS and 36.7% mAP, suggesting potential for data augmentation and world model applications.

Results

Model + BEVDiffuser	mAP Gain	NDS Gain
BEVFormer-tiny	+12.3%	+10.1%
BEVFormerV2	+13.5%	+8.8%
BEVFusion	Consistent gains	Consistent gains

Long-tail object detection (BEVFormer-tiny):

Category	mAP Improvement
Construction vehicle	+24.1%
Bus	+29.5%

Challenging conditions:

Condition	mAP Gain (tiny)	mAP Gain (V2)
Night	+20.0%	+28.9%
Weather	Improvements across all conditions	-

Generative performance: 41.1% NDS and 36.7% mAP generating BEV features from pure noise with layout conditioning.

Computational cost: Zero inference-time overhead; maintains baseline FPS.

Limitations

Requires ground-truth 3D bounding boxes during training, which are expensive to annotate; cannot be used with unlabeled data
The "denoised" BEV features may not represent true ground-truth BEV; the diffusion model learns a proxy that may introduce its own biases
Evaluated only on nuScenes; generalization to other datasets and sensor configurations is unvalidated
The generative capability (BEV from noise) is preliminary; whether generated BEV features are diverse and realistic enough for actual data augmentation is unexplored
Training time increases due to the additional diffusion model, even though inference is unchanged

Connections

Bevformer Learning Birds Eye View Representation From Multi Camera Images Via Spatiotemporal Transformers -- BEVFormer is the primary base model enhanced by BEVDiffuser
Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D -- LSS established the BEV generation paradigm that BEVDiffuser denoises
Gaussianlss Toward Real World Bev Perception With Depth Uncertainty Via Gaussian Splatting -- Both address BEV quality; GaussianLSS improves generation, BEVDiffuser improves post-hoc via denoising
Denoising Diffusion Probabilistic Models -- DDPM provides the diffusion foundation that BEVDiffuser adapts to BEV feature denoising
Perception -- Addresses BEV feature quality, a fundamental perception bottleneck
Autonomous Driving -- Improves detection robustness for safety-critical driving