Tags
239 tags across the wiki
Pages tagged foundation-model
📄 **[Read on arXiv](https://arxiv.org/abs/2403.09631)** 3D-VLA addresses a fundamental limitation of existing vision-language-action models: their reliance on 2D visual representations, which lack the spatial depth unde…
📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…
📄 **[Read on arXiv](https://arxiv.org/abs/2401.12963)** AutoRT addresses the critical data scarcity problem in robotics by using foundation models not as end-effectors but as intelligent orchestrators of large-scale rob…
📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…
📄 **[Read on arXiv](https://arxiv.org/abs/2501.03575)** The Cosmos World Foundation Model Platform addresses Physical AI's critical challenge: the scarcity of safe, high-quality training data. By providing high-fidelity…
📄 **[Read on arXiv](https://arxiv.org/abs/2501.12948)** DeepSeek-R1 demonstrates that sophisticated reasoning capabilities -- including self-verification, reflection, and extended chain-of-thought -- can emerge in large…
📄 **[Read on arXiv](https://arxiv.org/abs/2305.18290)** Aligning large language models (LLMs) with human preferences has traditionally required reinforcement learning from human feedback (RLHF), a complex multi-stage pi…
[Read on arXiv](https://arxiv.org/abs/2412.14415) DriveGPT (Cruise, ICML 2025) is the first work to systematically study scaling laws for autoregressive behavior models in autonomous driving. Drawing inspiration from th…
📄 **[Read on arXiv](https://arxiv.org/abs/2104.14294)** DINO (self-DIstillation with NO labels) demonstrates that self-supervised learning with Vision Transformers produces features with remarkable emergent properties t…
📄 **[Read on arXiv](https://arxiv.org/abs/2410.23262)** EMMA is Waymo's industry-scale demonstration of the "everything as language tokens" paradigm for autonomous driving. A single large multimodal foundation model uni…
📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…
📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…
📄 **[Read on arXiv](https://arxiv.org/abs/2503.20020)** Gemini Robotics introduces a family of AI models built on Gemini 2.0 designed to extend advanced multimodal capabilities into physical robotics. The work addresses…
📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…
📄 **[Read on arXiv](https://arxiv.org/abs/2403.09630)** > **Note:** This is the CVPR 2024 Highlight paper on large-scale video prediction for driving, NOT the ECCV 2024 paper wiki/sources/papers/genad-generative-end-to-…
📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…
📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…
📄 **[Read on arXiv](https://arxiv.org/abs/2503.14734)** GR00T N1 addresses the challenge of creating general-purpose humanoid robots through an innovative "data pyramid" approach. Rather than relying solely on expensive…
📄 **[Read on arXiv](https://arxiv.org/abs/2204.06125)** DALL-E 2 (internally called unCLIP) introduces a hierarchical approach to text-conditional image generation that leverages CLIP's joint text-image embedding space…
📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…
📄 **[Read on arXiv](https://arxiv.org/abs/2409.20537)** HPT tackles the fundamental challenge of building generalist robot representations that work across heterogeneous embodiments with different sensor configurations,…
📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…
📄 **[Read on arXiv](https://arxiv.org/abs/2307.09288)** Llama 2 (Touvron et al., Meta AI, 2023) addresses the gap between open-source pretrained language models and polished, closed-source "product" LLMs like ChatGPT. W…
📄 **[Read on arXiv](https://arxiv.org/abs/2402.01817)** This paper by Subbarao Kambhampati and colleagues at Arizona State University addresses one of the most important questions in modern AI: can large language models…
📄 **[Read on arXiv](https://arxiv.org/abs/2106.09685)** As pretrained language models grow to hundreds of billions of parameters, full fine-tuning -- updating every weight for each downstream task -- becomes prohibitive…
📄 **[Read on arXiv](https://arxiv.org/abs/2312.00752)** Transformers have dominated sequence modeling since 2017, but their quadratic-complexity self-attention mechanism creates a fundamental bottleneck for long sequenc…
📄 **[Read on arXiv](https://arxiv.org/abs/2310.06825)** Mistral 7B (Jiang et al., Mistral AI, 2023) challenged the prevailing assumption that larger language models are always better by demonstrating that a carefully de…
📄 **[Read on arXiv](https://arxiv.org/abs/2401.04088)** Mixtral 8x7B, developed by Mistral AI, introduces a Sparse Mixture-of-Experts (SMoE) language model that achieves the quality of much larger dense models at a frac…
📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…
📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…
📄 **[Read on arXiv](https://arxiv.org/abs/2204.02311)** PaLM (Pathways Language Model) is a 540-billion parameter dense decoder-only Transformer language model trained by Google using the Pathways distributed training s…
[Read on arXiv](https://arxiv.org/abs/2504.16054) pi0.5 is the successor to pi0, developed by Physical Intelligence, and represents the first VLA model capable of performing 10-15 minute long-horizon tasks in previously…
[Read on arXiv](https://arxiv.org/abs/2410.24164) pi0 is a vision-language-action flow model developed by Physical Intelligence that represents a foundational step toward general-purpose robot control. The key innovatio…
📄 **[Read on arXiv](https://arxiv.org/abs/2305.14314)** Full fine-tuning of large language models requires enormous GPU memory -- a 65B-parameter model in 16-bit precision needs over 780 GB of GPU memory for parameters…
📄 **[Read on arXiv](https://arxiv.org/abs/2505.09388)** Qwen3, developed by the Qwen team at Alibaba, represents a major step forward in open-weight language models by offering a comprehensive family spanning both dense…
[Read on arXiv](https://arxiv.org/abs/2410.07864) RDT-1B (Tsinghua University, ICLR 2025) presents the largest diffusion transformer for bimanual robot manipulation, scaling to 1.2B parameters. Bimanual manipulation --…
📄 **[Read on arXiv](https://arxiv.org/abs/2210.03629)** Large language models had demonstrated two powerful capabilities in isolation: chain-of-thought reasoning for multi-step problem solving, and action generation for…
📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…
📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…
📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…
📄 **[Read on arXiv](https://arxiv.org/abs/2408.00714)** SAM 2 extends the Segment Anything Model (SAM) from static image segmentation to unified promptable visual segmentation across both images and videos. Published by…
📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…
📄 **[Read on arXiv](https://arxiv.org/abs/2210.11416)** Large language models exhibit strong few-shot capabilities, but their ability to follow instructions and generalize to unseen tasks remains limited without targete…
📄 **[Read on arXiv](https://arxiv.org/abs/2304.02643)** Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmen…
📄 **[Read on arXiv](https://arxiv.org/abs/2509.15155)** This Google DeepMind paper addresses a fundamental limitation of Embodied Foundation Models (EFMs): while they demonstrate impressive semantic generalization (unde…
📄 **[Read on arXiv](https://arxiv.org/abs/2103.14030)** Vision Transformers (ViT) demonstrated that pure transformer architectures could match or exceed CNNs on image classification, but ViT's design introduced two fund…
📄 **[Read on arXiv](https://arxiv.org/abs/2302.04761)** Large language models exhibit remarkable in-context learning abilities but paradoxically struggle with tasks that are trivial for simple external tools -- arithmet…
📄 **[Read on arXiv](https://arxiv.org/abs/2203.15556)** The Chinchilla paper (Hoffmann et al., DeepMind, 2022) is one of the most consequential papers in the LLM era because it corrected the field's scaling intuition. K…
📄 **[Read on arXiv](https://arxiv.org/abs/2203.02155)** Large language models like GPT-3 are trained on vast internet corpora to predict the next token, but this objective is fundamentally misaligned with the goal of fo…
📄 **[Read on arXiv](https://arxiv.org/abs/2305.10601)** Language models are typically used in a left-to-right token-generation mode, which limits their ability to explore alternative reasoning paths or backtrack from mi…
**[Read on arXiv](https://arxiv.org/abs/2501.10105)** UniAct addresses a critical challenge in embodied AI: robot action data suffers from severe heterogeneity across platforms, control interfaces, and physical embodime…
📄 **[Read on arXiv](https://arxiv.org/abs/2310.06114)** UniSim addresses a fundamental bottleneck in embodied AI: the lack of high-fidelity, interactive simulators that generalize across domains. Rather than building se…
📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…
:page_facing_up: **[Read on arXiv](https://arxiv.org/abs/2412.14803)** Video Prediction Policy (VPP) by Hu, Guo et al. (ICML 2025 Spotlight) proposes that video diffusion models (VDMs) are not just generators of future…
📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…