Overview
This wiki maps the convergence of machine learning, robotics, and foundation models into real autonomy systems — 190 papers from 2012 to 2026, spanning foundational architectures (Transformer, ViT, ResNet), the LLM revolution (GPT-4, Llama 2, LoRA, Chain-of-Thought), vision-language breakthroughs (CLIP, LLaVA, SAM), and the cutting edge of embodied AI and autonomous driving.
The field has undergone three major shifts:
-
Modular → End-to-End: From perception-prediction-planning pipelines (UniAD CVPR 2023 Best Paper, VAD) to unified architectures where DriveTransformer parallelizes all tasks in a single transformer and DiffusionDrive achieves real-time diffusion planning at 45 FPS.
-
Imitation → RL: From pure behavioral cloning to RL-enhanced planning where CarPlanner is the first RL planner to beat IL+rules on nuPlan, and π₀.₆ doubles robot task throughput via offline RL self-improvement.
-
Task-specific → Generalist VLA: From narrow models to Vision-Language-Action agents that generalize across embodiments — π₀ (flow matching across 7 robots, 68 tasks), CrossFormer (one policy for 20+ embodiments including quadcopters), GR00T N1, and Gemini Robotics.
Five research pillars
1. End-to-end autonomous driving
The perception→prediction→planning decomposition (Perception, Prediction, Planning) is being collapsed. UniAD unified it into one framework; DriveTransformer (ICLR 2025) parallelized all tasks; DriveGPT (Waymo, ICML 2025) proved LLM-style scaling laws hold for driving. Diffusion and flow-matching planners (DiffusionDrive, GoalFlow) displaced autoregressive methods, while NAVSIM became the definitive evaluation benchmark with 143 teams. See End To End Architectures.
2. Vision-language-action models
VLA models matured from proof-of-concept to open-source infrastructure in 2024. OpenVLA (7B, 970K demos) outperforms the closed RT-2-X (55B) by 16.5%. Octo was the first open generalist robot policy. π₀ introduced flow matching for continuous 50 Hz control. The dual-system pattern (slow VLM reasoning at 7–10 Hz + fast motor control at 120–200 Hz) independently emerged at Google DeepMind, Physical Intelligence, NVIDIA, and Figure AI.
3. LLM reasoning for driving and robotics
LLMs transitioned from curiosity to structured cognitive agents. Agent-Driver established the LLM-as-agent framework with tool use and chain-of-thought reasoning. DriveLM introduced graph-structured VQA reasoning. LLMs Can't Plan (ICML 2024) provided theoretical grounding for why LLMs should reason, not plan — pairing with external verifiers. ECoT increased VLA success by 28% through embodied reasoning.
4. Foundation models and cross-embodiment transfer
Foundation models proved cross-embodiment scaling works. CrossFormer (900K trajectories, 20+ embodiments) is the first single policy for manipulators, navigators, quadrupeds, and aerial vehicles. HPT demonstrated scaling laws for heterogeneous robot pretraining across 52 datasets. UniSim enables zero-shot real-world transfer from learned simulators. The foundational stack — CLIP, Latent Diffusion, LoRA, Mamba — underpins all of it.
5. BEV perception and 3D occupancy
BEV-based 3D perception pivoted to Gaussians, sparsity, and world models. GaussianFormer replaced dense voxels with semantic Gaussians (75–82% memory reduction). OccWorld pioneered occupancy-based world models with GPT-like generation. SparseOcc introduced the RayIoU metric that became the community standard. SelfOcc eliminated the annotation bottleneck with self-supervised training. See Perception.
The foundational ML stack
The wiki also covers the papers that made all of the above possible:
| Era | Key papers |
|---|---|
| Architecture | Transformer (91K+ cit.), ViT (91K+), Swin (44K+), ResNet |
| Language models | GPT-4 (26K+), Llama 2 (22K+), Mistral 7B, Mixtral |
| Vision-language | CLIP (58K+), LLaVA (13K+), SAM (19K+), Flamingo |
| Generative | Latent Diffusion (32K+), DDPM, Diffusion Beats GANs |
| Efficiency | LoRA (29K+), QLoRA, Prefix-Tuning |
| Alignment | InstructGPT (24K+), DPO, Chain-of-Thought (27K+) |
| Agents | ReAct (8K+), Toolformer |
Open questions by stream
Each pillar has dedicated open questions grounded in the papers above. See Open Questions for the full question tree.
| Stream | Questions | Key tension |
|---|---|---|
| End-to-End Driving | 9 | Unified vs. decoupled, generative vs. discriminative |
| VLA Models | 10 | Dual-system convergence, cross-embodiment limits |
| LLM Reasoning | 9 | Language as scaffold vs. core, reasoning vs. planning |
| Foundation Models | 10 | Open vs. closed, scaling laws for embodied AI |
| BEV & 3D Occupancy | 10 | Dense vs. Gaussian, occupancy in E2E |
Five cross-cutting themes emerge: RL frontier (every stream hitting an IL ceiling), scaling laws for embodied AI, distillation as deployment, evaluation adequacy, and explicit structure vs. learned representations. The Research Thesis synthesizes these into a unified view.
Navigation
| Section | Description |
|---|---|
| Open Questions | Root page for 48 open questions across 5 streams |
| Research Map | Field breakdown across research directions |
| Vision Language Action | VLA evolution from CIL to π₀ — the core action paradigm |
| Ilya Top 30 | Ilya's curated 30-paper curriculum on deep learning foundations |
| Vla And Driving | 90+ driving and robotics VLA papers organized by wave |
| Research Thesis | Current high-level thesis with evidence for and against |
| Modular Vs End To End | The core systems architecture debate |