Tags

239 tags across the wiki

Pages tagged diffusion

3D-VLA: A 3D Vision-Language-Action Generative World Model

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

source-summary

**[Read on arXiv](https://arxiv.org/abs/2502.19694)** BEVDiffuser addresses a fundamental but under-explored problem in BEV-based perception: the inherent noise in BEV feature maps caused by sensor limitations and the l…

Denoising Diffusion Probabilistic Models

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2006.11239)** Ho, Jain, and Abbeel, NeurIPS, 2020. - [Paper](https://arxiv.org/abs/2006.11239) Denoising Diffusion Probabilistic Models (DDPM) demonstrates that high-quality ima…

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

source-summary

[Read on arXiv](https://arxiv.org/abs/2502.05855) DexVLA introduces a paradigm shift in VLA architecture by scaling the action generation component to 1 billion parameters using a diffusion-based expert, rather than foc…

Diffusion Models Beat GANs on Image Synthesis

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2105.05233)** This paper by Dhariwal and Nichol (OpenAI, 2021) demonstrates that diffusion models can surpass GANs on image synthesis for the first time, achieving state-of-the-…

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

source-summary

[Read on arXiv](https://arxiv.org/abs/2411.15139) DiffusionDrive (HUST/Horizon Robotics, CVPR 2025 Highlight) proposes a truncated diffusion model for end-to-end autonomous driving that achieves real-time inference whil…

Genad Generalized Predictive Model For Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.09630)** > **Note:** This is the CVPR 2024 Highlight paper on large-scale video prediction for driving, NOT the ECCV 2024 paper wiki/sources/papers/genad-generative-end-to-…

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…

High-Resolution Image Synthesis with Latent Diffusion Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.15014)** OccGen reframes 3D semantic occupancy prediction as a conditional generative problem rather than a purely discriminative one. Prior occupancy methods (SurroundOcc,…

RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation

source-summary

[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…

Unisim Learning Interactive Real World Simulators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…

Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.17398)** Vista (NeurIPS 2024) is a generalizable driving world model that achieves high-fidelity video prediction at 10 Hz and 576x1024 resolution with versatile multi-moda…