Open Questions: BEV Perception & 3D Occupancy
Stream-specific open questions for the BEV perception and 3D occupancy pillar. See Open Questions for the full tree across all streams.
Representation paradigm
-
Dense vs. sparse vs. Gaussian: BEVNeXt revived dense BEV to 64.2 NDS SOTA. SparseOcc introduced fully sparse prediction. GaussianFormer represents scenes as sparse 3D Gaussians (5-6x memory reduction). GaussianFormer-2 reduces overlap by 64% with probabilistic superposition. Which paradigm will win at production scale: the accuracy of dense, the efficiency of sparse, or the flexibility of Gaussians?
-
Voxels vs. world models: OccWorld and Drive-OccWorld predict future occupancy states for planning. GaussianWorld models scene evolution. Is occupancy prediction converging toward world models (predict future states) rather than single-frame reconstruction?
-
Self-supervised occupancy: SelfOcc eliminated the 3D annotation bottleneck. GaussianOcc achieves fully self-supervised estimation without even ground-truth poses. GaussRender uses foundation model alignment for zero-shot semantics. Will self-supervised methods close the gap to supervised, making dense 3D annotation obsolete?
Efficiency and deployment
-
Linear-complexity architectures: OccMamba replaces transformer quadratic attention with linear state-space models (+5.1% IoU, 65% faster). Is Mamba-style linear modeling the path to real-time 3D occupancy, or are efficient transformer variants (FlashAttention, sparse attention) sufficient?
-
Training-only augmentation: BEVDiffuser uses a diffusion model for BEV denoising that is removed at inference (zero overhead, +12.3% mAP). FlashOcc's Channel-to-Height plugin replaces 3D convolutions with 2D processing. How much deployment-free training enrichment can close the gap between efficient and accurate architectures?
-
Real-time occupancy budget: For production driving at 10+ Hz, what is the acceptable quality/latency trade-off for 3D occupancy? Is 20 mIoU at 30 FPS more valuable than 40 mIoU at 5 FPS?
View transformation and depth
-
Forward vs. backward view transforms: LSS (forward, geometric) vs. BEVFormer (backward, learned attention) vs. FB-BEV (both fused). Has FB-BEV settled this debate, or does the optimal approach depend on deployment constraints (compute, latency, accuracy requirements)?
-
Depth estimation bottleneck: BEVNeXt's CRF-modulated depth and GaussianLSS's explicit depth uncertainty both improve depth quality. BEVFormer v2's perspective supervision adapts any backbone to 3D. Is monocular depth estimation still the primary bottleneck for camera-only 3D perception?
Occupancy in E2E systems
-
Occupancy role in planning: Drive-OccWorld shows 33% L2 error reduction when planning against predicted occupancy. OccGen applies diffusion to occupancy (+9.5-13.3% over discriminative). But E2E VLA systems (EMMA, DriveTransformer) bypass explicit occupancy. Is occupancy a necessary intermediate representation, or will it be absorbed into learned E2E representations?
-
Evaluation metrics: SparseOcc's RayIoU became a community standard. But does mIoU or RayIoU actually correlate with downstream driving quality? No paper has rigorously tested this. The perception→planning alignment gap applies to occupancy metrics as much as detection metrics.
Partially answered
- Q1 (Dense vs. sparse vs. Gaussian): GaussianFormer-2's probabilistic superposition and OccMamba's linear SSMs suggest the field is moving away from dense voxels. But BEVNeXt's dense SOTA shows that dense methods aren't dead.
- Q3 (Self-supervised): GaussianOcc and GaussRender demonstrate fully self-supervised is viable. The gap to supervised is closing but remains significant for fine-grained semantic classes.
- Q7 (View transforms): FB-BEV showed both are complementary. The practical answer is likely deployment-dependent.