Tags

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

source-summary

**[Read on arXiv](https://arxiv.org/abs/2502.19694)** BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the l…

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

Bevnext Reviving Dense Bev Frameworks For 3D Object Detection

paper

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

DrivoR: Driving on Registers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

EMMA: End-to-End Multimodal Model for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…

Fb Bev Bev Representation From Forward Backward View Transformations

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12058)** Occupancy prediction has emerged as a powerful perception paradigm for autonomous driving, predicting per-voxel semantic labels in 3D space to handle arbitrary obj…

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2407.14108)** Bird's-eye view (BEV) semantic segmentation from multi-camera images is a core perception task in autonomous driving, but existing image-to-BEV transformation meth…

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.17288)** GaussianFlowOcc (ICCV 2025) introduces a transformative approach to 3D semantic occupancy estimation for autonomous driving by replacing traditional…

Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17429)** GaussianFormer introduces a fundamentally different scene representation for 3D semantic occupancy prediction: instead of dense voxel grids, scenes are modeled as…

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.04384)** GaussianFormer-2 addresses 3D semantic occupancy prediction for vision-centric autonomous driving by rethinking how 3D Gaussians represent occupied space. The origin…

GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2504.01957)** Bird's-Eye View (BEV) perception faces a fundamental trade-off between accuracy and computational efficiency. High-performing 3D projection methods like BEVFormer…

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2408.11447)** GaussianOcc by Gan et al. (University of Tokyo / RIKEN / South China University of Technology / SIAT-CAS) is a systematic method that applies Gaussi…

Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.10373)** GaussianWorld introduces a world model paradigm for 3D occupancy prediction that explicitly models scene evolution over time, rather than treating frames as indepe…

GaussRender: Learning 3D Occupancy with Gaussian Rendering

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2502.05040)** GaussRender by Chambon et al. (Valeo AI / Sorbonne, ICCV 2025) introduces a plug-and-play training-time module that improves 3D occupancy prediction…

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

source-summary

**[Read on arXiv](https://arxiv.org/abs/2412.13193)** GaussTR is a Gaussian-based Transformer framework that achieves zero-shot semantic occupancy prediction without any 3D annotations. The key idea is to combine sparse…

Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.14729)** HERMES tackles a fundamental limitation in autonomous driving: existing systems treat 3D scene understanding and future scene generation as separate problems. Driv…

Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2008.05711)** Lift, Splat, Shoot (LSS) introduced a differentiable pipeline for transforming multi-camera images into a unified bird's-eye view (BEV) representation without requ…

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.15014)** OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc,…

OccMamba: Semantic Occupancy Prediction with State Space Models

source-summary

**[Read on arXiv](https://arxiv.org/abs/2408.09859)** OccMamba is the first Mamba-based network for semantic occupancy prediction, replacing transformer architectures' quadratic complexity with Mamba's linear complexity…

Open Questions: BEV Perception & 3D Occupancy

query

Stream-specific open questions for the BEV perception and 3D occupancy pillar. See wiki/queries/open-questions for the full tree across all streams. 1. **Dense vs. sparse vs. Gaussian:** BEVNeXt revived dense BEV to 64.…

Perception

concept

Perception converts raw sensor data into structured scene representations for downstream prediction and planning. In autonomous driving, perception encompasses detection, tracking, segmentation, occupancy estimation, la…

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

source-summary

:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.12725)** RaCFormer by Chu et al. (USTC, CVPR 2025) addresses a fundamental problem in radar-camera fusion for 3D object detection: the image-to-BEV transform…

S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2505.24139)** S4-Driver is a self-supervised framework that adapts Multimodal Large Language Models (MLLMs) for autonomous vehicle motion planning. The system processes multi-vi…

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12754)** SelfOcc (Huang et al., Tsinghua University, CVPR 2024) introduces the first self-supervised framework for vision-based 3D occupancy prediction that works with mult…

SparseOcc: Fully Sparse 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.09551)** SurroundOcc addresses the problem of dense 3D semantic occupancy prediction from multi-camera images for autonomous driving. Unlike 3D object detection, which repr…

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.06242)** Think Twice (Jia et al., 2023) addresses a fundamental imbalance in end-to-end autonomous driving: while the community has invested heavily in sophisticated encode…

YOLOv10: Real-Time End-to-End Object Detection

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.14458)** Real-time object detection is critical infrastructure for autonomous driving, robotics, and augmented reality, yet the dominant YOLO family has long relied on non-…

Pages tagged perception