ESC

Open Questions: BEV Perception & 3D Occupancy

Stream-specific open questions for the BEV perception and 3D occupancy pillar. See Open Questions for the full tree across all streams.

Representation paradigm

  1. Dense vs. sparse vs. Gaussian: BEVNeXt revived dense BEV to 64.2 NDS SOTA. SparseOcc introduced fully sparse prediction. GaussianFormer represents scenes as sparse 3D Gaussians (5-6x memory reduction). GaussianFormer-2 reduces overlap by 64% with probabilistic superposition. Which paradigm will win at production scale: the accuracy of dense, the efficiency of sparse, or the flexibility of Gaussians?

  2. Voxels vs. world models: OccWorld and Drive-OccWorld predict future occupancy states for planning. GaussianWorld models scene evolution. Is occupancy prediction converging toward world models (predict future states) rather than single-frame reconstruction?

  3. Self-supervised occupancy: SelfOcc eliminated the 3D annotation bottleneck. GaussianOcc achieves fully self-supervised estimation without even ground-truth poses. GaussRender uses foundation model alignment for zero-shot semantics. Will self-supervised methods close the gap to supervised, making dense 3D annotation obsolete?

Efficiency and deployment

  1. Linear-complexity architectures: OccMamba replaces transformer quadratic attention with linear state-space models (+5.1% IoU, 65% faster). Is Mamba-style linear modeling the path to real-time 3D occupancy, or are efficient transformer variants (FlashAttention, sparse attention) sufficient?

  2. Training-only augmentation: BEVDiffuser uses a diffusion model for BEV denoising that is removed at inference (zero overhead, +12.3% mAP). FlashOcc's Channel-to-Height plugin replaces 3D convolutions with 2D processing. How much deployment-free training enrichment can close the gap between efficient and accurate architectures?

  3. Real-time occupancy budget: For production driving at 10+ Hz, what is the acceptable quality/latency trade-off for 3D occupancy? Is 20 mIoU at 30 FPS more valuable than 40 mIoU at 5 FPS?

View transformation and depth

  1. Forward vs. backward view transforms: LSS (forward, geometric) vs. BEVFormer (backward, learned attention) vs. FB-BEV (both fused). Has FB-BEV settled this debate, or does the optimal approach depend on deployment constraints (compute, latency, accuracy requirements)?

  2. Depth estimation bottleneck: BEVNeXt's CRF-modulated depth and GaussianLSS's explicit depth uncertainty both improve depth quality. BEVFormer v2's perspective supervision adapts any backbone to 3D. Is monocular depth estimation still the primary bottleneck for camera-only 3D perception?

Occupancy in E2E systems

  1. Occupancy role in planning: Drive-OccWorld shows 33% L2 error reduction when planning against predicted occupancy. OccGen applies diffusion to occupancy (+9.5-13.3% over discriminative). But E2E VLA systems (EMMA, DriveTransformer) bypass explicit occupancy. Is occupancy a necessary intermediate representation, or will it be absorbed into learned E2E representations?

  2. Evaluation metrics: SparseOcc's RayIoU became a community standard. But does mIoU or RayIoU actually correlate with downstream driving quality? No paper has rigorously tested this. The perception→planning alignment gap applies to occupancy metrics as much as detection metrics.

Partially answered

  • Q1 (Dense vs. sparse vs. Gaussian): GaussianFormer-2's probabilistic superposition and OccMamba's linear SSMs suggest the field is moving away from dense voxels. But BEVNeXt's dense SOTA shows that dense methods aren't dead.
  • Q3 (Self-supervised): GaussianOcc and GaussRender demonstrate fully self-supervised is viable. The gap to supervised is closing but remains significant for fine-grained semantic classes.
  • Q7 (View transforms): FB-BEV showed both are complementary. The practical answer is likely deployment-dependent.

Key papers for this stream

Paper Relevance
Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D LSS: foundational BEV paradigm
Bevformer Learning Birds Eye View Representation From Multi Camera Images Via Spatiotemporal Transformers Learned BEV via deformable attention
Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction Gaussian scene representation
Gaussianformer 2 Probabilistic Gaussian Superposition For Efficient 3D Occupancy Prediction Probabilistic Gaussian superposition
Occworld Learning A 3D Occupancy World Model For Autonomous Driving Occupancy world model
Occmamba Semantic Occupancy Prediction With State Space Models Linear-complexity occupancy
Bevnext Reviving Dense Bev Frameworks For 3D Object Detection Dense BEV revival
Surroundocc Multi Camera 3D Occupancy Prediction For Autonomous Driving Foundational occupancy prediction
Flashocc Fast And Memory Efficient Occupancy Prediction Via Channel To Height Plugin Efficient 2D-only occupancy
Sparseocc Fully Sparse 3D Occupancy Prediction Sparse occupancy + RayIoU metric