Tags

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…

Attention Is All You Need

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/1810.04805)** Devlin, Chang, Lee, Toutanova (Google AI Language), NAACL, 2019. - [Paper](https://aclanthology.org/N19-1423/) - [arXiv](https://arxiv.org/abs/1810.04805) BERT (Bi…

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

Bevnext Reviving Dense Bev Frameworks For 3D Object Detection

paper

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2501.12948)** DeepSeek-R1 demonstrates that sophisticated reasoning capabilities -- including self-verification, reflection, and extended chain-of-thought -- can emerge in large…

Direct Preference Optimization Your Language Model Is Secretly A Reward Model

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.18290)** Aligning large language models (LLMs) with human preferences has traditionally required reinforcement learning from human feedback (RLHF), a complex multi-stage pi…

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.00398)** DriveAdapter (Jia et al., ICCV 2023) identifies and addresses a fundamental structural problem in end-to-end autonomous driving: the tight coupling between percept…

Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.07656)** DriveTransformer represents a fundamental departure from existing end-to-end autonomous driving approaches. Rather than following sequential perception-prediction-…

DrivoR: Driving on Registers

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

Emerging Properties in Self-Supervised Vision Transformers (DINO)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2104.14294)** DINO (self-DIstillation with NO labels) demonstrates that self-supervised learning with Vision Transformers produces features with remarkable emergent properties t…

Fb Bev Bev Representation From Forward Backward View Transformations

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

Flamingo: a Visual Language Model for Few-Shot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…

Gemma 3 Technical Report

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…

GPT-4 Technical Report

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

High-Resolution Image Synthesis with Latent Diffusion Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…

Learning Transferable Visual Models From Natural Language Supervision

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

Llama 2: Open Foundation and Fine-Tuned Chat Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2307.09288)** Llama 2 (Touvron et al., Meta AI, 2023) addresses the gap between open-source pretrained language models and polished, closed-source "product" LLMs like ChatGPT. W…

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

Lora Low Rank Adaptation Of Large Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2106.09685)** As pretrained language models grow to hundreds of billions of parameters, full fine-tuning -- updating every weight for each downstream task -- becomes prohibitive…

Mistral 7B

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06825)** Mistral 7B (Jiang et al., Mistral AI, 2023) challenged the prevailing assumption that larger language models are always better by demonstrating that a carefully de…

Mixtral Of Experts

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2401.04088)** Mixtral 8x7B, developed by Mistral AI, introduces a Sparse Mixture-of-Experts (SMoE) language model that achieves the quality of much larger dense models at a frac…

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

Octo An Open Source Generalist Robot Policy

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

On The Opportunities And Risks Of Foundation Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

Palm Scaling Language Modeling With Pathways

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2204.02311)** PaLM (Pathways Language Model) is a 540-billion parameter dense decoder-only Transformer language model trained by Google using the Pathways distributed training s…

Prefix Tuning Optimizing Continuous Prompts For Generation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2101.00190)** Large pretrained language models like GPT-2 and BART achieve strong performance on generation tasks, but full fine-tuning requires storing a separate copy of all m…

Qlora Efficient Finetuning Of Quantized Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.14314)** Full fine-tuning of large language models requires enormous GPU memory -- a 65B-parameter model in 16-bit precision needs over 780 GB of GPU memory for parameters…

Qwen3 Technical Report

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2505.09388)** Qwen3, developed by the Qwen team at Alibaba, represents a major step forward in open-weight language models by offering a comprehensive family spanning both dense…

Robocat A Self Improving Generalist Agent For Robotic Manipulation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

RoboVLMs: What Matters in Building Vision-Language-Action Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

RT-1: Robotics Transformer for Real-World Control at Scale

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

RT-H: Action Hierarchies Using Language

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

SAM 2: Segment Anything in Images and Videos

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.00714)** SAM 2 extends the Segment Anything Model (SAM) from static image segmentation to unified promptable visual segmentation across both images and videos. Published by…

Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2210.11416)** Large language models exhibit strong few-shot capabilities, but their ability to follow instructions and generalize to unseen tasks remains limited without targete…

Segment Anything

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.02643)** Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmen…

SparseOcc: Fully Sparse 3D Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2103.14030)** Vision Transformers (ViT) demonstrated that pure transformer architectures could match or exceed CNNs on image classification, but ViT's design introduced two fund…

Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2305.06242)** Think Twice (Jia et al., 2023) addresses a fundamental imbalance in end-to-end autonomous driving: while the community has invested heavily in sophisticated encode…

Toolformer: Language Models Can Teach Themselves to Use Tools

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2302.04761)** Large language models exhibit remarkable in-context learning abilities but paradoxically struggle with tasks that are trivial for simple external tools -- arithmet…

Training Compute-Optimal Large Language Models

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2203.15556)** The Chinchilla paper (Hoffmann et al., DeepMind, 2022) is one of the most consequential papers in the LLM era because it corrected the field's scaling intuition. K…

Training Language Models to Follow Instructions with Human Feedback

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2203.02155)** Large language models like GPT-3 are trained on vast internet corpora to predict the next token, but this objective is fundamentally misaligned with the goal of fo…

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

source-summary

📄 **[Read on arXiv](https://arxiv.org/abs/2205.15997)** TransFuser (Chitta et al., 2022) is a foundational paper for transformer-based sensor fusion in end-to-end autonomous driving. The key problem it addresses is how…

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2402.13243)** VADv2 by Chen et al. (2024) is the successor to VAD, addressing a fundamental limitation of deterministic planners in autonomous driving: they output a single traj…

Visual Instruction Tuning (LLaVA)

paper

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

Pages tagged transformer