ESC

Overview

This wiki maps the convergence of machine learning, robotics, and foundation models into real autonomy systems — 190 papers from 2012 to 2026, spanning foundational architectures (Transformer, ViT, ResNet), the LLM revolution (GPT-4, Llama 2, LoRA, Chain-of-Thought), vision-language breakthroughs (CLIP, LLaVA, SAM), and the cutting edge of embodied AI and autonomous driving.

The field has undergone three major shifts:

  1. Modular → End-to-End: From perception-prediction-planning pipelines (UniAD CVPR 2023 Best Paper, VAD) to unified architectures where DriveTransformer parallelizes all tasks in a single transformer and DiffusionDrive achieves real-time diffusion planning at 45 FPS.

  2. Imitation → RL: From pure behavioral cloning to RL-enhanced planning where CarPlanner is the first RL planner to beat IL+rules on nuPlan, and π₀.₆ doubles robot task throughput via offline RL self-improvement.

  3. Task-specific → Generalist VLA: From narrow models to Vision-Language-Action agents that generalize across embodiments — π₀ (flow matching across 7 robots, 68 tasks), CrossFormer (one policy for 20+ embodiments including quadcopters), GR00T N1, and Gemini Robotics.

Five research pillars

1. End-to-end autonomous driving

The perception→prediction→planning decomposition (Perception, Prediction, Planning) is being collapsed. UniAD unified it into one framework; DriveTransformer (ICLR 2025) parallelized all tasks; DriveGPT (Waymo, ICML 2025) proved LLM-style scaling laws hold for driving. Diffusion and flow-matching planners (DiffusionDrive, GoalFlow) displaced autoregressive methods, while NAVSIM became the definitive evaluation benchmark with 143 teams. See End To End Architectures.

2. Vision-language-action models

VLA models matured from proof-of-concept to open-source infrastructure in 2024. OpenVLA (7B, 970K demos) outperforms the closed RT-2-X (55B) by 16.5%. Octo was the first open generalist robot policy. π₀ introduced flow matching for continuous 50 Hz control. The dual-system pattern (slow VLM reasoning at 7–10 Hz + fast motor control at 120–200 Hz) independently emerged at Google DeepMind, Physical Intelligence, NVIDIA, and Figure AI.

3. LLM reasoning for driving and robotics

LLMs transitioned from curiosity to structured cognitive agents. Agent-Driver established the LLM-as-agent framework with tool use and chain-of-thought reasoning. DriveLM introduced graph-structured VQA reasoning. LLMs Can't Plan (ICML 2024) provided theoretical grounding for why LLMs should reason, not plan — pairing with external verifiers. ECoT increased VLA success by 28% through embodied reasoning.

4. Foundation models and cross-embodiment transfer

Foundation models proved cross-embodiment scaling works. CrossFormer (900K trajectories, 20+ embodiments) is the first single policy for manipulators, navigators, quadrupeds, and aerial vehicles. HPT demonstrated scaling laws for heterogeneous robot pretraining across 52 datasets. UniSim enables zero-shot real-world transfer from learned simulators. The foundational stack — CLIP, Latent Diffusion, LoRA, Mamba — underpins all of it.

5. BEV perception and 3D occupancy

BEV-based 3D perception pivoted to Gaussians, sparsity, and world models. GaussianFormer replaced dense voxels with semantic Gaussians (75–82% memory reduction). OccWorld pioneered occupancy-based world models with GPT-like generation. SparseOcc introduced the RayIoU metric that became the community standard. SelfOcc eliminated the annotation bottleneck with self-supervised training. See Perception.

The foundational ML stack

The wiki also covers the papers that made all of the above possible:

Era Key papers
Architecture Transformer (91K+ cit.), ViT (91K+), Swin (44K+), ResNet
Language models GPT-4 (26K+), Llama 2 (22K+), Mistral 7B, Mixtral
Vision-language CLIP (58K+), LLaVA (13K+), SAM (19K+), Flamingo
Generative Latent Diffusion (32K+), DDPM, Diffusion Beats GANs
Efficiency LoRA (29K+), QLoRA, Prefix-Tuning
Alignment InstructGPT (24K+), DPO, Chain-of-Thought (27K+)
Agents ReAct (8K+), Toolformer

Open questions by stream

Each pillar has dedicated open questions grounded in the papers above. See Open Questions for the full question tree.

Stream Questions Key tension
End-to-End Driving 9 Unified vs. decoupled, generative vs. discriminative
VLA Models 10 Dual-system convergence, cross-embodiment limits
LLM Reasoning 9 Language as scaffold vs. core, reasoning vs. planning
Foundation Models 10 Open vs. closed, scaling laws for embodied AI
BEV & 3D Occupancy 10 Dense vs. Gaussian, occupancy in E2E

Five cross-cutting themes emerge: RL frontier (every stream hitting an IL ceiling), scaling laws for embodied AI, distillation as deployment, evaluation adequacy, and explicit structure vs. learned representations. The Research Thesis synthesizes these into a unified view.

Section Description
Open Questions Root page for 48 open questions across 5 streams
Research Map Field breakdown across research directions
Vision Language Action VLA evolution from CIL to π₀ — the core action paradigm
Ilya Top 30 Ilya's curated 30-paper curriculum on deep learning foundations
Vla And Driving 90+ driving and robotics VLA papers organized by wave
Research Thesis Current high-level thesis with evidence for and against
Modular Vs End To End The core systems architecture debate