LLM Seminal Papers

This page tracks the canonical LLM and adjacent foundation-model papers that matter for the autonomy side of the wiki.

Foundational surveys and frameworks

On The Opportunities And Risks Of Foundation Models -- Stanford HAI report (2021) that coined "foundation model" and formalized emergence + homogenization as the defining phenomena of the paradigm

Core architecture and scaling

Attention Is All You Need
Language Models are Unsupervised Multitask Learners
BERT
Scaling Laws for Neural Language Models
Training Compute-Optimal Large Language Models
GPT-3
Mixtral of Experts (Sparse MoE)
LLaMA
Qwen3 (dense + MoE family with thinking mode, 36T tokens, 119 languages)

Parameter-efficient fine-tuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation
LoRA: Low-Rank Adaptation of Large Language Models
Adapter methods

Instruction tuning and alignment

Scaling Instruction Finetuned Language Models -- Flan-PaLM/Flan-T5: scaled instruction finetuning to 1,836 tasks + CoT data across T5 and PaLM architectures, 0.2% of pre-training compute for +9.4% held-out improvement, established instruction tuning as standard post-training recipe
InstructGPT
Direct Preference Optimization Your Language Model Is Secretly A Reward Model -- DPO: eliminates RL from preference alignment via closed-form reward reparameterization
RLHF preference-optimization papers
Constitutional AI

Reasoning and search

Chain-of-Thought Prompting
Tree Of Thoughts Deliberate Problem Solving With Large Language Models -- ToT: generalizes CoT into tree-structured search with LM-based state evaluation, 74% on Game of 24 vs. 4% for CoT, NeurIPS 2023
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning -- DeepSeek-R1: emergent reasoning from pure RL with rule-based rewards (GRPO), multi-stage pipeline, effective distillation to 1.5B-70B models

Tool use and retrieval

ReAct
Toolformer
Retrieval-Augmented Generation papers when directly useful

Open-weight multimodal model families

Gemma 3 (Google DeepMind, 2025) -- 1B-27B open-weight family with native vision via SigLIP, 128K context via 5:1 local/global attention, knowledge distillation enabling 4B to match prior 27B. Gemma 3 Technical Report

Multimodal bridge

CLIP
DINO (self-supervised ViT features)
BLIP
Flamingo
LLaVA (Visual Instruction Tuning) -- established the dominant open-source recipe for multimodal instruction-following: CLIP encoder + linear projection + LLM, with GPT-assisted instruction data generation
Kosmos-style multimodal models
GPT-4V era papers and technical reports
Gemini 2.5 (sparse MoE multimodal with inference-time reasoning)

What to extract during ingest

architectural contribution,
scaling insight,
training recipe,
alignment method,
relevance to autonomy or VLA systems.

Already seeded in batch 01

Ingested individually

Training Language Models To Follow Instructions With Human Feedback -- InstructGPT: RLHF alignment pipeline (SFT -> RM -> PPO), foundational for instruction-tuned LLMs
Flamingo A Visual Language Model For Few Shot Learning -- Flamingo: few-shot multimodal learning via frozen LM + Perceiver Resampler + gated cross-attention, template for VLM architecture
Blip Bootstrapping Language Image Pre Training For Unified Vision Language Understanding And Generation -- BLIP: unified encoder-decoder for vision-language understanding + generation, CapFilt data bootstrapping, ICML 2022
Llama 2 Open Foundation And Fine Tuned Chat Models -- Llama 2: open-source RLHF-aligned LLM family (7B-70B), detailed alignment pipeline with dual reward models, backbone for driving VLA systems (e.g., AsyncDriver)
Visual Instruction Tuning -- LLaVA: visual instruction tuning via CLIP + linear projection + Vicuna, GPT-assisted data generation, blueprint for open-source multimodal models
Mixtral Of Experts -- Mixtral 8x7B: sparse MoE LLM, top-2 of 8 experts per token, 13B active of 47B total, matches Llama 2 70B quality
Palm Scaling Language Modeling With Pathways -- PaLM: 540B dense Transformer trained via Pathways, SOTA few-shot on 28/29 benchmarks, emergent discontinuous scaling on BIG-bench
Scaling Instruction Finetuned Language Models -- Flan-PaLM/Flan-T5: instruction finetuning at scale (1,836 tasks + CoT), 75.2% MMLU, +9.4% on held-out tasks, architecture-agnostic across T5 and PaLM
Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities -- Gemini 2.5: sparse MoE multimodal Transformer with "Thinking" inference-time reasoning, 1M+ token context, AIME 2025 88.0%, agentic capabilities
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning -- DeepSeek-R1: reasoning via RL (GRPO) with rule-based rewards, competitive with o1 on math/code, distillation to small models
Qwen3 Technical Report -- Qwen3: dense (0.6B-32B) + MoE (30B-A3B, 235B-A22B) family with unified thinking mode, 36T tokens across 119 languages, four-stage post-training with reasoning RL, Apache 2.0
Tree Of Thoughts Deliberate Problem Solving With Large Language Models -- ToT: tree-structured search over LM-generated thoughts with LM-based evaluation, generalizes CoT, NeurIPS 2023
Gemma 3 Technical Report -- Gemma 3: 1B-27B open-weight multimodal family, 5:1 local/global attention for 128K context, SigLIP vision encoder, knowledge distillation enabling 4B to match prior 27B