Tags

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

Bevnext Reviving Dense Bev Frameworks For 3D Object Detection

paper

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

Covla Comprehensive Vision Language Action Dataset For Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.10845)** Autonomous driving systems face the "long tail" problem -- handling countless rare and complex driving scenarios beyond common situations. While traditional rule-b…

CS231n: Deep Learning for Computer Vision

source-summary

📄 **[Course Website](https://cs231n.stanford.edu/)** Li, Karpathy, and Johnson, Stanford University, 2015 (ongoing). - [Course](https://cs231n.stanford.edu/) CS231n is a widely used Stanford deep learning for computer v…

Deep Residual Learning for Image Recognition

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1512.03385)** He, Zhang, Ren, Sun (Microsoft Research), CVPR, 2016. - [Paper](https://arxiv.org/abs/1512.03385) Deep Residual Learning introduces skip connections that add the i…

Diffusion Models Beat GANs on Image Synthesis

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2105.05233)** This paper by Dhariwal and Nichol (OpenAI, 2021) demonstrates that diffusion models can surpass GANs on image synthesis for the first time, achieving state-of-the-…

Emerging Properties in Self-Supervised Vision Transformers (DINO)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2104.14294)** DINO (self-DIstillation with NO labels) demonstrates that self-supervised learning with Vision Transformers produces features with remarkable emergent properties t…

Exploring Simple Siamese Representation Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2011.10566)** SimSiam (Simple Siamese) demonstrates that self-supervised visual representation learning can be dramatically simplified while maintaining competitive performance.…

Fb Bev Bev Representation From Forward Backward View Transformations

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

Flamingo: a Visual Language Model for Few-Shot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12058)** Occupancy prediction has emerged as a powerful perception paradigm for autonomous driving, predicting per-voxel semantic labels in 3D space to handle arbitrary obj…

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2407.14108)** Bird's-eye view (BEV) semantic segmentation from multi-camera images is a core perception task in autonomous driving, but existing image-to-BEV transformation meth…

Genad Generalized Predictive Model For Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09630)** > **Note:** This is the CVPR 2024 Highlight paper on large-scale video prediction for driving, NOT the ECCV 2024 paper wiki/sources/papers/genad-generative-end-to-…

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…

High-Resolution Image Synthesis with Latent Diffusion Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…

Identity Mappings in Deep Residual Networks

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1603.05027)** This paper, a follow-up to the original ResNet work, provides both theoretical analysis and empirical evidence that the arrangement of operations within residual b…

ImageNet Classification with Deep Convolutional Neural Networks

source-summary

📄 **[Read Paper](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)** AlexNet, as this paper's architecture came to be known, is a deep convolutional neural network trained on GPUs th…

Learning Transferable Visual Models From Natural Language Supervision

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

Multi Scale Context Aggregation By Dilated Convolutions

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1511.07122)** This paper introduced dilated (atrous) convolutions as a principled alternative to the downsample-then-upsample paradigm for dense prediction tasks. By inserting g…

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.15014)** OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc,…

On The Opportunities And Risks Of Foundation Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

SAM 2: Segment Anything in Images and Videos

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.00714)** SAM 2 extends the Segment Anything Model (SAM) from static image segmentation to unified promptable visual segmentation across both images and videos. Published by…

Segment Anything

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.02643)** Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmen…

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.12754)** SelfOcc (Huang et al., Tsinghua University, CVPR 2024) introduces the first self-supervised framework for vision-based 3D occupancy prediction that works with mult…

SparseOcc: Fully Sparse 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.09551)** SurroundOcc addresses the problem of dense 3D semantic occupancy prediction from multi-camera images for autonomous driving. Unlike 3D object detection, which repr…

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2103.14030)** Vision Transformers (ViT) demonstrated that pure transformer architectures could match or exceed CNNs on image classification, but ViT's design introduced two fund…

Unisim Learning Interactive Real World Simulators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17398)** Vista (NeurIPS 2024) is a generalizable driving world model that achieves high-fidelity video prediction at 10 Hz and 576x1024 resolution with versatile multi-moda…

Visual Instruction Tuning (LLaVA)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

YOLOv10: Real-Time End-to-End Object Detection

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.14458)** Real-time object detection is critical infrastructure for autonomous driving, robotics, and augmented reality, yet the dominant YOLO family has long relied on non-…

Pages tagged computer-vision