ESC

Papers

53 paper summaries tagged transformer

DrivoR: Driving on Registers
2026 arXiv 3

📄 **[Read on arXiv](https://arxiv.org/abs/2601.05083)** DrivoR is a full-transformer autonomous driving architecture that uses camera-aware register tokens to compress multi-camera Vision Transformer features into a com…

paper autonomous-driving e2e perception +3
Qwen3 Technical Report
2025 arXiv 3706

📄 **[Read on arXiv](https://arxiv.org/abs/2505.09388)** Qwen3, developed by the Qwen team at Alibaba, represents a major step forward in open-weight language models by offering a comprehensive family spanning both dense…

nlp language-modeling transformer mixture-of-experts +4
Gemma 3 Technical Report
2025 arXiv 1120

📄 **[Read on arXiv](https://arxiv.org/abs/2503.19786)** Gemma 3 is a family of open-weight language models from Google DeepMind spanning 1B, 4B, 12B, and 27B parameters. It represents a significant leap over Gemma 2 by…

transformer language-modeling multimodal foundation-model +4
Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities
2025 arXiv 1943

📄 **[Read on arXiv](https://arxiv.org/abs/2507.06261)** Gemini 2.5 is Google's frontier multimodal model family, built on a sparse Mixture-of-Experts (MoE) Transformer architecture. It represents a major advance in reas…

nlp multimodal foundation-model transformer +5
Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
2025 ICLR 2025 91

📄 **[Read on arXiv](https://arxiv.org/abs/2503.07656)** DriveTransformer represents a fundamental departure from existing end-to-end autonomous driving approaches. Rather than following sequential perception-prediction-…

autonomous-driving transformer end-to-end planning
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning
2025 arXiv 1920

📄 **[Read on arXiv](https://arxiv.org/abs/2501.12948)** DeepSeek-R1 demonstrates that sophisticated reasoning capabilities -- including self-verification, reflection, and extended chain-of-thought -- can emerge in large…

nlp reinforcement-learning language-modeling reasoning +4
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
2024 arXiv 140

📄 **[Read on arXiv](https://arxiv.org/abs/2402.13243)** VADv2 by Chen et al. (2024) is the successor to VAD, addressing a fundamental limitation of deterministic planners in autonomous driving: they output a single traj…

autonomous-driving end-to-end planning vectorized-representation +2
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
2024 ICLR 2024 150

📄 **[Read on arXiv](https://arxiv.org/abs/2312.13139)** GR-1 addresses a fundamental bottleneck in robot learning: the scarcity of diverse, high-quality robot demonstration data. The key insight is that robot trajectori…

robotics transformer imitation-learning multimodal +3
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
2024 CVPR 50

📄 **[Read on arXiv](https://arxiv.org/abs/2404.09502)** Dense 3D occupancy prediction from multi-view cameras has become a key perception task for autonomous driving, but most methods process the full voxel volume -- in…

autonomous-driving perception 3d-occupancy computer-vision +2
SparseOcc: Fully Sparse 3D Occupancy Prediction
2024 ECCV 80

📄 **[Read on arXiv](https://arxiv.org/abs/2312.17118)** 3D occupancy prediction has become a critical perception paradigm for autonomous driving, but existing methods process dense 3D volumes even though over 90% of vox…

autonomous-driving perception 3d-occupancy sparse-representation +3
Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
2024 CoRL 2024 Oral 100

📄 **[Read on arXiv](https://arxiv.org/abs/2408.11812)** CrossFormer addresses a fundamental limitation in robot learning: the requirement for specialized policies for each robotic platform. Traditional approaches train…

robotics transformer cross-embodiment imitation-learning +2
SAM 2: Segment Anything in Images and Videos
2024 arXiv (ECCV 2024 submission) 3925

📄 **[Read on arXiv](https://arxiv.org/abs/2408.00714)** SAM 2 extends the Segment Anything Model (SAM) from static image segmentation to unified promptable visual segmentation across both images and videos. Published by…

computer-vision segmentation foundation-model transformer +2
RT-H: Action Hierarchies Using Language
2024 RSS 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2403.01823)** RT-H (Robot Transformer with Action Hierarchies) introduces a hierarchical approach to multi-task robot control that uses natural language as an intermediate repre…

robotics vla transformer imitation-learning +2
RoboVLMs: What Matters in Building Vision-Language-Action Models
2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2412.14058)** RoboVLMs is a large-scale empirical study from Tsinghua University, ByteDance Research, and collaborators that systematically investigates the design principles fo…

robotics vla transformer multimodal +2
RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
2024 ICLR 2024 100

📄 **[Read on arXiv](https://arxiv.org/abs/2311.01378)** RoboFlamingo addresses the question of whether publicly available vision-language models (VLMs) can serve as effective backbones for robot imitation learning, with…

robotics vla imitation-learning multimodal +2
Octo An Open Source Generalist Robot Policy
2024 RSS 400

📄 **[Read on arXiv](https://arxiv.org/abs/2405.12213)** Octo is a transformer-based generalist robot policy trained on 800,000 robot trajectories from the Open X-Embodiment dataset, spanning 25 diverse datasets and mult…

robotics transformer foundation-model open-source +3
Mixtral Of Experts
2024 arXiv 3089

📄 **[Read on arXiv](https://arxiv.org/abs/2401.04088)** Mixtral 8x7B, developed by Mistral AI, introduces a Sparse Mixture-of-Experts (SMoE) language model that achieves the quality of much larger dense models at a frac…

nlp language-modeling transformer mixture-of-experts +2
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
2024 CoRL 2024

📄 **[Read on arXiv](https://arxiv.org/abs/2406.11815)** LLARVA addresses the "embodiment gap" between large multimodal models (LMMs) and robotic control. While VLMs trained on internet-scale data excel at visual underst…

robotics vla multimodal imitation-learning +3
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
2024 arXiv 50

📄 **[Read on arXiv](https://arxiv.org/abs/2410.06158)** GR-2 is a generalist robot manipulation agent from ByteDance Research that leverages large-scale video-language pretraining to build a world model for robotic cont…

robotics vla transformer foundation-model +4
Bevnext Reviving Dense Bev Frameworks For 3D Object Detection
2024 CVPR 2024 80

📄 [arXiv:2312.01696](https://arxiv.org/abs/2312.01696) BEVNeXt revives dense BEV (bird's-eye-view) frameworks for camera-based 3D object detection, demonstrating that with the right design choices, dense approaches can…

autonomous-driving perception bev transformer +2
Visual Instruction Tuning (LLaVA)
2023 NeurIPS 2023 13533

📄 **[Read on arXiv](https://arxiv.org/abs/2304.08485)** Large language models transformed NLP through instruction tuning -- training on diverse instruction-response pairs so models follow human intent across tasks. Visu…

multimodal vision-language-model instruction-tuning transformer +3
Toolformer: Language Models Can Teach Themselves to Use Tools
2023 NeurIPS 2023 3994

📄 **[Read on arXiv](https://arxiv.org/abs/2302.04761)** Large language models exhibit remarkable in-context learning abilities but paradoxically struggle with tasks that are trivial for simple external tools -- arithmet…

nlp language-modeling tool-use transformer +2
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
2023 CVPR 2023 180

📄 **[Read on arXiv](https://arxiv.org/abs/2305.06242)** Think Twice (Jia et al., 2023) addresses a fundamental imbalance in end-to-end autonomous driving: while the community has invested heavily in sophisticated encode…

autonomous-driving end-to-end planning imitation-learning +2
Segment Anything
2023 ICCV 2023 19692

📄 **[Read on arXiv](https://arxiv.org/abs/2304.02643)** Segment Anything introduces a foundation model for image segmentation -- the Segment Anything Model (SAM) -- together with a new task definition (promptable segmen…

computer-vision foundation-model segmentation transformer +2
Robocat A Self Improving Generalist Agent For Robotic Manipulation
2023 TMLR 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2306.11706)** RoboCat, developed by Google DeepMind, is a multi-embodiment, multi-task generalist agent for robotic manipulation built on a transformer-based architecture. The p…

robotics transformer imitation-learning multimodal +2
Qlora Efficient Finetuning Of Quantized Language Models
2023 NeurIPS 2023 5975

📄 **[Read on arXiv](https://arxiv.org/abs/2305.14314)** Full fine-tuning of large language models requires enormous GPU memory -- a 65B-parameter model in 16-bit precision needs over 780 GB of GPU memory for parameters…

nlp transformer language-modeling foundation-model +2
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
2023 ICCV 280

📄 **[Read on arXiv](https://arxiv.org/abs/2304.05316)** Vision-based 3D semantic occupancy prediction aims to predict the semantic class and occupancy status of every voxel in a 3D volume surrounding the ego vehicle, us…

autonomous-driving perception transformer computer-vision +3
Mistral 7B
2023 arXiv 4052

📄 **[Read on arXiv](https://arxiv.org/abs/2310.06825)** Mistral 7B (Jiang et al., Mistral AI, 2023) challenged the prevailing assumption that larger language models are always better by demonstrating that a carefully de…

nlp language-modeling transformer foundation-model +1
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023 arXiv 22411

📄 **[Read on arXiv](https://arxiv.org/abs/2307.09288)** Llama 2 (Touvron et al., Meta AI, 2023) addresses the gap between open-source pretrained language models and polished, closed-source "product" LLMs like ChatGPT. W…

llm transformer foundation-model language-modeling +2
GPT-4 Technical Report
2023 arXiv 26297

📄 **[Read on arXiv](https://arxiv.org/abs/2303.08774)** GPT-4 is a large-scale multimodal Transformer model developed by OpenAI that accepts both image and text inputs and produces text outputs. It represents a major st…

nlp language-modeling transformer foundation-model +3
Fb Bev Bev Representation From Forward Backward View Transformations
2023 ICCV 150

📄 **[Read on arXiv](https://arxiv.org/abs/2308.02236)** FB-BEV addresses a fundamental tension in camera-based BEV perception for autonomous driving: **forward projection** methods (like Lift-Splat-Shoot) generate BEV f…

autonomous-driving perception bev transformer +1
DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
2023 ICCV 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2308.00398)** DriveAdapter (Jia et al., ICCV 2023) identifies and addresses a fundamental structural problem in end-to-end autonomous driving: the tight coupling between percept…

autonomous-driving end-to-end planning imitation-learning +2
Direct Preference Optimization Your Language Model Is Secretly A Reward Model
2023 NeurIPS 2023 8520

📄 **[Read on arXiv](https://arxiv.org/abs/2305.18290)** Aligning large language models (LLMs) with human preferences has traditionally required reinforcement learning from human feedback (RLHF), a complex multi-stage pi…

nlp reinforcement-learning language-modeling alignment +2
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
2023 CVPR 2023

📄 **[Read on arXiv](https://arxiv.org/abs/2211.10439)** BEVFormer v2 addresses a critical bottleneck in camera-based 3D perception for autonomous driving: the inability to leverage powerful modern 2D image backbones (e.…

autonomous-driving perception bev transformer +2
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
2022 IEEE TPAMI 2023 600

📄 **[Read on arXiv](https://arxiv.org/abs/2205.15997)** TransFuser (Chitta et al., 2022) is a foundational paper for transformer-based sensor fusion in end-to-end autonomous driving. The key problem it addresses is how…

paper autonomous-driving e2e transformer +1
Training Language Models to Follow Instructions with Human Feedback
2022 NeurIPS 2022 24355

📄 **[Read on arXiv](https://arxiv.org/abs/2203.02155)** Large language models like GPT-3 are trained on vast internet corpora to predict the next token, but this objective is fundamentally misaligned with the goal of fo…

nlp reinforcement-learning language-modeling alignment +2
Training Compute-Optimal Large Language Models
2022 arXiv 4116

📄 **[Read on arXiv](https://arxiv.org/abs/2203.15556)** The Chinchilla paper (Hoffmann et al., DeepMind, 2022) is one of the most consequential papers in the LLM era because it corrected the field's scaling intuition. K…

nlp language-modeling transformer foundation-model +1
Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)
2022 JMLR 2024 3987

📄 **[Read on arXiv](https://arxiv.org/abs/2210.11416)** Large language models exhibit strong few-shot capabilities, but their ability to follow instructions and generalize to unseen tasks remains limited without targete…

nlp transformer instruction-tuning chain-of-thought +4
RT-1: Robotics Transformer for Real-World Control at Scale
2022 arXiv 2019

📄 **[Read on arXiv](https://arxiv.org/abs/2212.06817)** RT-1 is a landmark paper from Google/Everyday Robots demonstrating that a 35M-parameter Transformer model, trained on a large and diverse dataset of real-robot dem…

paper robotics vla transformer
Palm Scaling Language Modeling With Pathways
2022 JMLR 9058

📄 **[Read on arXiv](https://arxiv.org/abs/2204.02311)** PaLM (Pathways Language Model) is a 540-billion parameter dense decoder-only Transformer language model trained by Google using the Pathways distributed training s…

transformer language-modeling scaling foundation-model +3
Lora Low Rank Adaptation Of Large Language Models
2022 ICLR 2022 29175

📄 **[Read on arXiv](https://arxiv.org/abs/2106.09685)** As pretrained language models grow to hundreds of billions of parameters, full fine-tuning -- updating every weight for each downstream task -- becomes prohibitive…

nlp transformer language-modeling foundation-model +1
High-Resolution Image Synthesis with Latent Diffusion Models
2022 CVPR 2022 31987

📄 **[Read on arXiv](https://arxiv.org/abs/2112.10752)** Latent Diffusion Models (LDMs), the architecture behind Stable Diffusion, address the prohibitive computational cost of applying diffusion models directly in pixel…

diffusion generative-models computer-vision image-generation +2
Flamingo: a Visual Language Model for Few-Shot Learning
2022 NeurIPS 2022 7824

📄 **[Read on arXiv](https://arxiv.org/abs/2204.14198)** Flamingo, developed by DeepMind, is a family of visual language models that extend the in-context few-shot learning ability of large language models to multimodal…

multimodal foundation-model computer-vision nlp +3
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
2022 ICML 2022 8650

📄 **[Read on arXiv](https://arxiv.org/abs/2201.12086)** Vision-language pre-training (VLP) methods before BLIP suffered from two fundamental limitations: (1) model architectures were typically optimized for either under…

multimodal foundation-model computer-vision nlp +4
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
2022 ECCV 1826

📄 **[Read on arXiv](https://arxiv.org/abs/2203.17270)** Li, Wang, Li, Xie, Sima, Lu, Yu, Dai (Shanghai AI Lab / Nanjing University / HKU), ECCV, 2022. - [Paper](https://arxiv.org/abs/2203.17270) BEVFormer generates a un…

paper autonomous-driving perception bev +1
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021 ICCV 2021 44596

📄 **[Read on arXiv](https://arxiv.org/abs/2103.14030)** Vision Transformers (ViT) demonstrated that pure transformer architectures could match or exceed CNNs on image classification, but ViT's design introduced two fund…

computer-vision transformer image-classification object-detection +3
Prefix Tuning Optimizing Continuous Prompts For Generation
2021 ACL 2021 6753

📄 **[Read on arXiv](https://arxiv.org/abs/2101.00190)** Large pretrained language models like GPT-2 and BART achieve strong performance on generation tasks, but full fine-tuning requires storing a separate copy of all m…

nlp transformer parameter-efficient language-modeling +1
On The Opportunities And Risks Of Foundation Models
2021 arXiv (Stanford HAI) 6057

📄 **[Read on arXiv](https://arxiv.org/abs/2108.07258)** "On the Opportunities and Risks of Foundation Models" is a comprehensive 200+ page report from over 100 researchers at Stanford's Center for Research on Foundation…

foundation-model nlp computer-vision robotics +3
Learning Transferable Visual Models From Natural Language Supervision
2021 ICML 2021 57987

📄 **[Read on arXiv](https://arxiv.org/abs/2103.00020)** CLIP (Contrastive Language-Image Pre-training) learns visual representations from natural language supervision by training an image encoder and a text encoder join…

computer-vision multimodal foundation-model transformer +3
Emerging Properties in Self-Supervised Vision Transformers (DINO)
2021 ICCV 2021 10798

📄 **[Read on arXiv](https://arxiv.org/abs/2104.14294)** DINO (self-DIstillation with NO labels) demonstrates that self-supervised learning with Vision Transformers produces features with remarkable emergent properties t…

computer-vision self-supervised-learning transformer vision-transformer +3
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2021 ICLR 2021 91128

📄 **[Read on arXiv](https://arxiv.org/abs/2010.11929)** Dosovitskiy et al., ICLR, 2021. - [Paper](https://arxiv.org/abs/2010.11929) The Vision Transformer (ViT) demonstrates that a pure Transformer applied to sequences…

ilya-30 vision-transformer computer-vision transformer +2
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019 NAACL 112487

📄 **[Read on arXiv](https://arxiv.org/abs/1810.04805)** Devlin, Chang, Lee, Toutanova (Google AI Language), NAACL, 2019. - [Paper](https://aclanthology.org/N19-1423/) - [arXiv](https://arxiv.org/abs/1810.04805) BERT (Bi…

paper llm transformer foundation
Attention Is All You Need
2017 NeurIPS 171783

📄 **[Read on arXiv](https://arxiv.org/abs/1706.03762)** Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS, 2017. - [Paper](https://arxiv.org/abs/1706.03762) - [The Annotated Transformer](htt…

paper ilya-30 llm transformer +3