ESC

Machine Learning

Machine learning provides the technical substrate for every system discussed in this wiki. This page traces the key ideas from the deep learning revolution through to the foundation-model era, with emphasis on the threads that feed into autonomous driving and embodied AI.

The deep learning revolution

Modern deep learning begins with the demonstration that large convolutional networks trained on GPU hardware can dramatically outperform hand-engineered features. Imagenet Classification With Deep Convolutional Neural Networks (AlexNet, 2012) cut ImageNet error nearly in half and launched a decade of architecture scaling. The critical follow-up was Deep Residual Learning For Image Recognition (ResNet, 2015), which showed that residual connections allow networks to scale to hundreds of layers without degradation, establishing the blueprint for virtually all modern vision backbones.

Parallel advances in sequence modeling proved equally important. Recurrent networks with attention, pioneered by Neural Machine Translation By Jointly Learning To Align And Translate (Bahdanau attention, 2014), showed that learned alignment could replace fixed-length bottlenecks. Specialized architectures like Pointer Networks demonstrated that output spaces could be variable and input-dependent, foreshadowing the flexible decoding strategies used in modern planners. Neural Turing Machines explored external memory for neural networks, an idea that resurfaces in world-model and map-memory designs for driving.

The transformer era

Attention Is All You Need (2017) unified these threads by replacing recurrence entirely with self-attention, enabling massive parallelism and scaling. The transformer architecture now dominates language (Language Models Are Few Shot Learners), vision (An Image Is Worth 16X16 Words Transformers For Image Recognition At Scale), speech (Deep Speech 2), and multimodal settings. Swin Transformer Hierarchical Vision Transformer Using Shifted Windows (Swin Transformer, 2021) made transformers practical as general-purpose vision backbones by introducing hierarchical multi-scale features and shifted window attention with linear complexity, replacing ResNet across detection, segmentation, and BEV perception pipelines. For driving, transformers underpin BEV encoders (BEVFormer), trajectory decoders (VAD), and the VLA systems that treat perception, prediction, and planning as a single sequence modeling problem.

Scaling laws and compute-optimal training

A defining insight of the 2020s is that model performance is predictable from scale. Scaling Laws For Neural Language Models (Kaplan et al., 2020) established power-law relationships between compute, data, parameters, and loss. Training Compute Optimal Large Language Models (Chinchilla, 2022) refined this to show that most large models were undertrained relative to their size. These findings directly shape how foundation models for driving are designed: the push toward larger VLMs and longer training schedules in systems like EMMA and DriveVLM follows from scaling-law reasoning.

Self-supervised and multimodal pretraining

Self-supervised pretraining allows models to learn general representations before task-specific fine-tuning. Bert Pre Training Of Deep Bidirectional Transformers For Language Understanding introduced masked language modeling; Learning Transferable Visual Models From Natural Language Supervision (CLIP) extended contrastive pretraining to vision-language pairs. Exploring Simple Siamese Representation Learning (SimSiam, 2021) showed that self-supervised visual learning can be dramatically simplified: a Siamese network with stop-gradient and a prediction MLP achieves competitive results without negative pairs, momentum encoders, or large batches, clarifying which components of prior methods were truly essential. Denoising Diffusion Probabilistic Models opened the generative modeling frontier, and Diffusion Models Beat Gans On Image Synthesis proved that diffusion models with classifier guidance could surpass GANs on image quality, catalyzing the shift toward diffusion-based generation across images, video, audio, 3D, and planning in driving contexts.

Reasoning and chain-of-thought

Chain Of Thought Prompting Elicits Reasoning In Large Language Models showed that prompting LLMs to produce intermediate reasoning steps dramatically improves performance on complex tasks. This idea is central to driving VLAs: systems like DriveLM and Reason2Drive use chain-of-thought structures to decompose driving decisions into perception, prediction, and planning stages before emitting actions.

Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning (DeepSeek-R1, 2025) demonstrated that chain-of-thought reasoning can emerge from pure reinforcement learning without human-annotated reasoning demonstrations. Using Group Relative Policy Optimization (GRPO) with simple rule-based rewards on a 671B MoE base model, R1-Zero discovers self-verification, reflection, and adaptive compute allocation. The full R1 model uses a multi-stage pipeline (cold-start SFT → reasoning RL → rejection sampling SFT → alignment RL) to reach performance competitive with OpenAI-o1 on math, code, and science benchmarks. Crucially, reasoning capabilities distill effectively to models as small as 1.5B parameters, democratizing access to strong reasoning. This represents a paradigm shift: the training objective (RL with outcome rewards), not just scale, is a key axis for capability emergence.

Parameter-efficient adaptation

As models scale to billions of parameters, full fine-tuning becomes impractical for multi-task deployment. Prefix Tuning Optimizing Continuous Prompts For Generation (2021) demonstrated that prepending learned continuous vectors to transformer key-value pairs at every layer enables task adaptation with only 0.1% of parameters, matching full fine-tuning on generation tasks. Lora Low Rank Adaptation Of Large Language Models (LoRA, ICLR 2022) became the dominant PEFT method: by freezing pretrained weights and injecting trainable low-rank decomposition matrices (Delta-W = BA, rank r << d), LoRA reduces trainable parameters by 10,000x on GPT-3 175B while matching full fine-tuning, with zero inference overhead after merging. Together, prefix-tuning, LoRA, adapters, and prompt tuning form the PEFT paradigm now standard for adapting foundation models to downstream tasks, including driving VLA systems that fine-tune large VLMs for action prediction.

The foundation model paradigm

On The Opportunities And Risks Of Foundation Models (Bommasani et al., 2021) formalized the concept of foundation models -- large models pretrained on broad data via self-supervision, then adapted to downstream tasks. The report identified two defining phenomena: emergence (unanticipated capabilities arising from scale) and homogenization (convergence around a few base models). Both have been validated by subsequent developments: GPT-4's emergent capabilities, the dominance of LoRA/RLHF adaptation pipelines, and the concentration of frontier model development among a handful of labs.

Present state and open problems

  • Scaling vs. efficiency: Scaling laws favor larger models, but driving demands real-time inference. How to reconcile these pressures remains open.
  • Distribution shift: ML models are brittle under distribution shift, and driving presents severe train/deploy mismatch (weather, geography, adversarial agents).
  • Uncertainty quantification: Most neural networks produce poorly calibrated confidence estimates, a critical gap for safety-critical deployment.
  • Data-centric ML: The field is shifting from architecture innovation to data curation, augmentation, and synthesis, but best practices for driving data remain unsettled.

Key papers

Paper Contribution
Imagenet Classification With Deep Convolutional Neural Networks Launched deep learning era with GPU-trained CNNs
Deep Residual Learning For Image Recognition Residual connections enabling very deep networks
Attention Is All You Need Transformer architecture replacing recurrence with self-attention
Scaling Laws For Neural Language Models Power-law scaling relationships for neural LMs
Learning Transferable Visual Models From Natural Language Supervision CLIP: contrastive vision-language pretraining
Exploring Simple Siamese Representation Learning SimSiam: minimal self-supervised learning without negatives or momentum
Chain Of Thought Prompting Elicits Reasoning In Large Language Models Intermediate reasoning steps improve LLM performance
Deepseek R1 Incentivizing Reasoning Capability In Llms Via Reinforcement Learning Emergent reasoning from RL; GRPO; distillation to small models
Denoising Diffusion Probabilistic Models Diffusion models for high-quality generation
Diffusion Models Beat Gans On Image Synthesis Classifier guidance enabling diffusion to surpass GANs
Neural Machine Translation By Jointly Learning To Align And Translate Attention mechanism for sequence-to-sequence models
Prefix Tuning Optimizing Continuous Prompts For Generation Parameter-efficient fine-tuning via continuous prefix optimization (0.1% params)
Lora Low Rank Adaptation Of Large Language Models LoRA: low-rank adaptation reducing trainable params by 10,000x with zero inference overhead
On The Opportunities And Risks Of Foundation Models Coined "foundation model"; emergence + homogenization framework
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows Hierarchical vision transformer; general-purpose backbone replacing CNNs