ESC

LLM Seminal Papers

This page tracks the canonical LLM and adjacent foundation-model papers that matter for the autonomy side of the wiki.

Foundational surveys and frameworks

Core architecture and scaling

  • Attention Is All You Need
  • Language Models are Unsupervised Multitask Learners
  • BERT
  • Scaling Laws for Neural Language Models
  • Training Compute-Optimal Large Language Models
  • GPT-3
  • Mixtral of Experts (Sparse MoE)
  • LLaMA
  • Qwen3 (dense + MoE family with thinking mode, 36T tokens, 119 languages)

Parameter-efficient fine-tuning

  • Prefix-Tuning: Optimizing Continuous Prompts for Generation
  • LoRA: Low-Rank Adaptation of Large Language Models
  • Adapter methods

Instruction tuning and alignment

Tool use and retrieval

  • ReAct
  • Toolformer
  • Retrieval-Augmented Generation papers when directly useful

Open-weight multimodal model families

  • Gemma 3 (Google DeepMind, 2025) -- 1B-27B open-weight family with native vision via SigLIP, 128K context via 5:1 local/global attention, knowledge distillation enabling 4B to match prior 27B. Gemma 3 Technical Report

Multimodal bridge

  • CLIP
  • DINO (self-supervised ViT features)
  • BLIP
  • Flamingo
  • LLaVA (Visual Instruction Tuning) -- established the dominant open-source recipe for multimodal instruction-following: CLIP encoder + linear projection + LLM, with GPT-assisted instruction data generation
  • Kosmos-style multimodal models
  • GPT-4V era papers and technical reports
  • Gemini 2.5 (sparse MoE multimodal with inference-time reasoning)

What to extract during ingest

  • architectural contribution,
  • scaling insight,
  • training recipe,
  • alignment method,
  • relevance to autonomy or VLA systems.

Already seeded in batch 01

Ingested individually