Random 10-paper fact-check sample (seed 20260417)
Entry Points
ML · Autonomy · Robotics · VLA · Foundation Models
This wiki maps the convergence of machine learning, robotics, and foundation models into real autonomy systems — **190 papers from 2012 to 2026**, spanning foundational architectures (Transformer, ViT, ResNet), the LLM…
Vision Language ActionThis page tracks the bridge from multimodal understanding to action generation, informed by the AutoVLA corpus of 18 papers spanning 2018–2025. A VLA system consumes visual context and language-conditioned intent, then…
Ilya Top 30Ilya Sutskever's curated reading list of ~30 papers and resources spanning the conceptual foundations of deep learning, from architecture breakthroughs to information theory and complexity. This list circulated widely a…
VLA and DrivingThis queue spans general VLA foundations and driving-specific multimodal action papers. The AutoVLA corpus (18 papers, 2018–2025) provides the most comprehensive coverage of how language-vision models have been applied…
Research ThesisThis page synthesizes the trajectory across 190 papers in the wiki, from foundational architectures through to 2025-era embodied AI. Updated with evidence from the full 2024 autonomy landscape and foundational ML corpus…
Open QuestionsThis page is the root of the open-questions tree. Each research pillar has its own dedicated page with stream-specific questions grounded in the papers we've ingested. ``` Overview ├── 1. End-to-End Driving (9 questions…
Paper Collections
View all papers →Ilya Top 30
29- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2021
- Scaling Laws for Neural Language Models 2020
- Denoising Diffusion Probabilistic Models 2020
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 2019
+24 more
AutoVLA / Driving
24- WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model 2025
- SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving 2025
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment 2025
- Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation 2025
- Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model 2025
+19 more
Foundation Models
66- DrivoR: Driving on Registers 2026
- Qwen3 Technical Report 2025
- Gemma 3 Technical Report 2025
- Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities 2025
- Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving 2025
+61 more
Autonomous Driving
92- SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving 2026
- DrivoR: Driving on Registers 2026
- WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model 2025
- SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving 2025
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment 2025
+87 more
Robotics / VLA
39- UniAct: Universal Actions for Enhanced Embodied Foundation Models 2025
- Towards Embodiment Scaling Laws in Robot Locomotion 2025
- SpatialVLA: Exploring Spatial Representations for VLA Models 2025
- SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics 2025
- Self-Improving Embodied Foundation Models 2025
+34 more
Sections
Comparisons
1
Concepts
9
Overview
1
Queries
7
Raw Readme
1
Sources
203
- 3D-VLA: A 3D Vision-Language-Action Generative World Model
- A Generalist Agent
- A Simple Neural Network Module for Relational Reasoning
- A Tutorial Introduction to the Minimum Description Length Principle
- Agent-Driver: A Language Agent for Autonomous Driving
- Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
- AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Asyncdriver Asynchronous Large Language Model Enhanced Planner For Autonomous Driving
- Attention Is All You Need
- Autonomous Driving Seminal Papers
- Autort Embodied Foundation Models For Large Scale Orchestration Of Robotic Agents
- Autovala Vision Language Action Model For End To End Autonomous Driving
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
- BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
- BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
- Bevnext Reviving Dense Bev Frameworks For 3D Object Detection
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- BridgeAD: Bridging Past and Future End-to-End Autonomous Driving with Historical Prediction
- CARLA: An Open Urban Driving Simulator
- CarPlanner: Consistent Auto-regressive RL Planner for Autonomous Driving
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
- Cosmos World Foundation Model Platform For Physical Ai
- Covla Comprehensive Vision Language Action Dataset For Autonomous Driving
- CS231n: Deep Learning for Computer Vision
- Deep Residual Learning for Image Recognition
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
- Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning
- Denoising Diffusion Probabilistic Models
- DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
- Diffusion Models Beat GANs on Image Synthesis
- DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
- Dima Distilling Multi Modal Large Language Models For Autonomous Driving
- Direct Preference Optimization Your Language Model Is Secretly A Reward Model
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
- Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles
- Drive-OccWorld: Driving in the Occupancy World
- DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
- DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving
- DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
- DriveGPT: Scaling Autoregressive Behavior Models for Driving
- DriveLM: Driving with Graph Visual Question Answering
- DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States
- DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
- Drivetransformer Unified Transformer For Scalable End To End Autonomous Driving
- DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
- Driving Gaussian Composite Gaussian Splatting For Surrounding Dynamic Driving Scenes
- Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
- DrivoR: Driving on Registers
- Emerging Properties in Self-Supervised Vision Transformers (DINO)
- EMMA: End-to-End Multimodal Model for Autonomous Driving
- End to End Learning for Self-Driving Cars
- End-to-end Driving via Conditional Imitation Learning
- Exploring Simple Siamese Representation Learning
- FAST: Efficient Action Tokenization for Vision-Language-Action Models
- Fb Bev Bev Representation From Forward Backward View Transformations
- Flamingo: a Visual Language Model for Few-Shot Learning
- FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin
- GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation
- GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
- Gaussianformer Scene As Gaussians For Vision Based 3D Semantic Occupancy Prediction
- GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
- GaussianLSS: Toward Real-world BEV Perception with Depth Uncertainty via Gaussian Splatting
- GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
- Gaussianworld Gaussian World Model For Streaming 3D Occupancy Prediction
- GaussRender: Learning 3D Occupancy with Gaussian Rendering
- GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- Gemini 25 Pushing The Frontier With Advanced Reasoning Multimodality Long Context And Next Generation Agentic Capabilities
- Gemini Robotics Bringing Ai Into The Physical World
- Gemma 3 Technical Report
- Genad Generalized Predictive Model For Autonomous Driving
- GenAD: Generative End-to-End Autonomous Driving
- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- GPT-4 Technical Report
- GPT-Driver: Learning to Drive with GPT
- GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
- Groot N1 An Open Foundation Model For Generalist Humanoid Robots
- Helix: A Vision-Language-Action Model for Generalist Humanoid Control
- Hermes A Unified Self Driving World Model For Simultaneous 3D Scene Understanding And Generation
- Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
- High-Resolution Image Synthesis with Latent Diffusion Models
- Hpt Scaling Proprioceptive Visual Learning With Heterogeneous Pre Trained Transformers
- Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation
- Identity Mappings in Deep Residual Networks
- Ilya Top 30
- ImageNet Classification with Deep Convolutional Neural Networks
- Initial Corpus Batch 01
- Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
- Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
- Knowledge Insulating Vision-Language-Action Models
- Kolmogorov Complexity and Algorithmic Randomness
- Language Models are Few-Shot Learners
- Languagempc Large Language Models As Decision Makers For Autonomous Driving
- LAW: Enhancing End-to-End Autonomous Driving with Latent World Model
- Learning by Cheating
- Learning Lane Graph Representations for Motion Forecasting
- Learning Transferable Visual Models From Natural Language Supervision
- Lift Splat Shoot Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
- LLM Seminal Papers
- LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
- Lmdrive Closed Loop End To End Driving With Large Language Models
- Lora Low Rank Adaptation Of Large Language Models
- Machine Super Intelligence
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Mistral 7B
- Mixtral Of Experts
- Momad Momentum Aware Planning In End To End Autonomous Driving
- Multi Scale Context Aggregation By Dilated Convolutions
- NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
- Neural Machine Translation by Jointly Learning to Align and Translate
- Neural Message Passing For Quantum Chemistry
- Neural Turing Machines
- Nuscenes A Multimodal Dataset For Autonomous Driving
- OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
- OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
- OccMamba: Semantic Occupancy Prediction with State Space Models
- Occworld Learning A 3D Occupancy World Model For Autonomous Driving
- Octo An Open Source Generalist Robot Policy
- On The Opportunities And Risks Of Foundation Models
- Opendrivevla Towards End To End Autonomous Driving With Large Vision Language Action Model
- OpenVLA-OFT: Optimizing Speed and Success for VLA Fine-Tuning
- OpenVLA: An Open-Source Vision-Language-Action Model
- Order Matters Sequence To Sequence For Sets
- Orion Holistic End To End Autonomous Driving By Vision Language Instructed Action Generation
- Palm Scaling Language Modeling With Pathways
- PaLM-E: An Embodied Multimodal Language Model
- PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving
- pi*0.6: A VLA That Learns From Experience
- pi0.5: A Vision-Language-Action Model with Open-World Generalization
- pi0: A Vision-Language-Action Flow Model for General Robot Control
- Planning-oriented Autonomous Driving
- Pointer Networks
- Prefix Tuning Optimizing Continuous Prompts For Generation
- Pseudo-Simulation for Autonomous Driving (NAVSIM v2)
- Qlora Efficient Finetuning Of Quantized Language Models
- Quantifying The Rise And Fall Of Complexity In Closed Systems The Coffee Automaton
- Qwen3 Technical Report
- RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
- RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation
- ReAct: Synergizing Reasoning and Acting in Language Models
- Reason2Drive Towards Interpretable And Chain Based Reasoning For Autonomous Driving
- Recurrent Neural Network Regularization
- Relational Recurrent Neural Networks
- Robocat A Self Improving Generalist Agent For Robotic Manipulation
- RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators
- Robotic Control via Embodied Chain-of-Thought Reasoning
- RoboVLMs: What Matters in Building Vision-Language-Action Models
- RT-1: Robotics Transformer for Real-World Control at Scale
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- RT-H: Action Hierarchies Using Language
- S4-Driver: Scalable Self-Supervised Driving MLLM with Spatio-Temporal Visual Representation
- SAM 2: Segment Anything in Images and Videos
- Scaling Cross Embodied Learning One Policy For Manipulation Navigation Locomotion And Aviation
- Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)
- Scaling Laws for Neural Language Models
- Segment Anything
- Self-Improving Embodied Foundation Models
- SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
- Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
- SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
- SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
- Source Ingest Queue
- SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation
- SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving
- SparseOcc: Fully Sparse 3D Occupancy Prediction
- SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
- SpatialVLA: Exploring Spatial Representations for VLA Models
- SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Talk2Car: Taking Control of Your Self-Driving Car
- Talk2Drive Towards Personalized Autonomous Driving With Large Language Models
- Textual Explanations for Self-Driving Vehicles
- The First Law of Complexodynamics
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
- Toolformer: Language Models Can Teach Themselves to Use Tools
- Towards Embodiment Scaling Laws in Robot Locomotion
- Training Compute-Optimal Large Language Models
- Training Language Models to Follow Instructions with Human Feedback
- TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Understanding LSTM Networks
- UniAct: Universal Actions for Enhanced Embodied Foundation Models
- Unisim Learning Interactive Real World Simulators
- Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation (GR-1)
- VAD: Vectorized Scene Representation for Efficient Autonomous Driving
- VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
- Variational Lossy Autoencoder
- VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
- Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
- Vista A Generalizable Driving World Model With High Fidelity And Versatile Controllability
- Visual Instruction Tuning (LLaVA)
- VLA and Driving
- VLP: Vision Language Planning for Autonomous Driving
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- WoTE: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
- YOLOv10: Real-Time End-to-End Object Detection
Syntheses
2
Taxonomies
1
Templates
2
Recent Activity
Random 20-paper sample (seed 20260416)
Random 20-paper serious-error check (seed 20260415)
Random 10-paper fact-check sample (seed 20260414)
High-Resolution Image Synthesis with Latent Diffusion Models
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Exploring Simple Siamese Representation Learning (SimSiam)
Diffusion Models Beat GANs on Image Synthesis