Log

[2026-04-16] audit | Batch 9 factuality audit — 7 papers

Audited 7 paper wiki pages against source PDFs (Mistral OCR): drivetransformer, drivevlm, driving-gaussian, driving-with-llms, ecot, embodiment-scaling-laws, fb-bev.
Fixed 3 factual errors in fb-bev:
Results table: removed fabricated BEVFormer v2 | ResNet-101 | 61.7 | 52.8 row (not in paper). Corrected FB-BEV backbone from ResNet-101 to V2-99 and replaced with actual paper Table 2 baselines (SOLOFusion, BEVStereo, BEVDepth).
Ablation table (forward vs. backward vs. combined): replaced fabricated test-set numbers (58.1/47.9, 59.3/49.5, 62.4/54.2) with actual val-set numbers from paper Table 1 (R50, no temporal, no depth).
"Effect of 3D Pre-training" table: replaced three-row fabricated pre-training ablation (59.8/50.1, 61.2/52.4, 62.4/54.2) with actual val-set depth supervision comparison from Table 1 (47.9/35.0 → 49.8/37.8). Status set to audited-corrected.
6 other papers verified clean against paper source: DriveTransformer (all Bench2Drive + nuScenes numbers confirmed), DriveVLM (Qwen-VL 9.6B, 410ms, nuScenes numbers confirmed), DrivingGaussian (PSNR/SSIM/LPIPS/rendering time confirmed), Driving-with-LLMs (all perception/action/QA table numbers confirmed), ECoT (key claims confirmed; note that per-category breakdown numbers in results table may be inaccurate summaries), Embodiment Scaling Laws (GENBOT-1K counts, compute stats, CoRL venue all confirmed).
Report written to .grounding/reports/batch9.md.

[2026-04-16] audit | Batch 14 factuality audit — 7 papers

Audited 7 paper wiki pages against source PDFs (Mistral OCR): learning-lane-graph-representations-for-motion-forecasting, learning-transferable-visual-models-from-natural-language-supervision, lift-splat-shoot-encoding-images-from-arbitrary-camera-rigs-by-implicitly-unprojecting-to-3d, llama-2-open-foundation-and-fine-tuned-chat-models, llarva-vision-action-instruction-tuning-enhances-robot-learning, llms-cant-plan-but-can-help-planning-in-llm-modulo-frameworks, lmdrive-closed-loop-end-to-end-driving-with-large-language-models.
Fixed 2 factual errors in learning-transferable-visual-models-from-natural-language-supervision (CLIP):
Architecture section incorrectly named ViT-L/14 as the best-performing model. Corrected to ViT-L/14@336px, which the paper explicitly states is the canonical "CLIP" result (Section 2.5: "all results reported in this paper as 'CLIP' use this model which we found to perform best").
Results table row label corrected from "Zero-shot CLIP (ViT-L/14)" to "Zero-shot CLIP (ViT-L/14@336px)".
Note: both ViT-L/14 and ViT-L/14@336px achieve 76.2% on ImageNet (Table 11), so the 76.2% figure itself is correct. Status set to audited-fixed.
6 other papers verified clean: LaneGCN (K=6 minADE=0.87m, minFDE=1.36m confirmed), LSS (nuScenes IoU table confirmed), Llama 2 (all benchmark numbers confirmed including previously-fixed GQA scope and LLaMA 1 65B TriviaQA=84.5, NQ=31.0), LLARVA (43.3%, +17.5%, +15% confirmed), LLMs Can't Plan (12%, 82%/70% case study results confirmed), LMDrive (all ablation scores confirmed).
Report written to .grounding/reports/batch14.md.

[2026-04-16] audit | Batch 5 factuality audit — 7 papers

Audited 7 paper wiki pages against source PDFs: diffusion-models-beat-gans-on-image-synthesis, dita-scaling-diffusion-transformer-for-generalist-vla-policy, emerging-properties-in-self-supervised-vision-transformers, end-to-end-driving-via-conditional-imitation-learning, end-to-end-learning-for-self-driving-cars, exploring-simple-siamese-representation-learning, fast-efficient-action-tokenization-for-vision-language-action-models.
Fixed 5 factual errors across 3 papers:
diffusion-models-beat-gans: BigGAN-deep 128x128 precision corrected 0.87→0.86, recall corrected 0.28→0.35; ADM-G 256x256 precision corrected 0.83→0.82, recall corrected 0.53→0.52; ADM-G+upsampling 512x512 precision corrected 0.87→0.84, recall corrected 0.42→0.53. (Source: paper Tables 5 and 6.)
emerging-properties-in-self-supervised-vision-transformers: Results bullet incorrectly stated "+3.5% over supervised ViT-S/16" — corrected to "+3.5% over best competing SSL methods (BYOL, MoCo v2, SwAV) on ViT-S/16". (Source: paper text, "DINO outperforms BYOL, MoCov2 and SwAV by +3.5%".)
exploring-simple-siamese-representation-learning: SimSiam 200-epoch ImageNet top-1 corrected 70.8→70.0. (Source: paper Table 4; 70.8 is the 400-epoch result.)
DITA, CIL (Codevilla), DAVE-2 (Bojarski/NVIDIA), and FAST all verified clean with no factual errors.
Report written to .grounding/reports/batch5.md.

[2026-04-11] audit | Targeted factuality audit — AlexNet + Hinton-van-Camp-1993

Audited imagenet-classification-with-deep-convolutional-neural-networks (AlexNet, NeurIPS 2012) against the NeurIPS proceedings PDF and official ILSVRC 2012 results page.
Fixed three errors: (1) Overview claimed single-model ILSVRC 2012 top-5 error was 18.9% — corrected to 15.3% ensemble / 16.4% single-model; 18.9% and 39.7% are ILSVRC-2010 results reported in the paper, not 2012 competition results. (2) Results bullet stated ensemble achieves 15.4% top-5 — corrected to 15.3% (official: 0.15315). (3) Overview margin-of-victory description updated to match corrected figures. Status set to audited-fixed.
Audited keeping-neural-networks-simple-by-minimizing-the-description-length-of-the-weights (Hinton & van Camp, COLT 1993) against Hinton's publication page and Semantic Scholar. Title, authors, year, and venue all verified correct. Primary PDF source was unreadable binary; relied on Hinton's own paper list and API metadata. Status set to audited-clean.

[2026-04-11] audit | Targeted factuality audit — transfuser + vad

Audited transfuser (2205.15997) and vad (2303.12077) against arxiv and AlphaXiv ground truth.
Fixed transfuser: auxiliary tasks listed "3D object detection" but the paper uses 2D vehicle detection (bounding boxes), not full 3D detection. Corrected in Key Contributions and ASCII diagram. Status set to audited-fixed.
Fixed vad: Overview incorrectly stated that VAD "directly influenced subsequent work like UniAD." UniAD was published at CVPR 2023 and is the prior state-of-the-art that VAD explicitly improves upon; VAD appeared at ICCV 2023. Corrected the direction of influence. Status set to audited-fixed.

[2026-04-11] audit | Targeted factuality audit — simlingo + talk2car

Audited simlingo and talk2car against arxiv and AlphaXiv ground truth.
Fixed one hard numerical error in simlingo: Action Dreaming success rates were reported as "28.22 to 72.96" but the paper (Table 5) gives baseline 24.52% and SimLingo 81.13%. Status set to audited-fixed.
talk2car checked out clean on all factual claims (dataset size 11,959 / 850 scenes, venue EMNLP-IJCNLP 2019, authors, AP50 metric). Status set to audited-clean.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260413)

Audited another deterministic random sample of 10 paper pages against the primary papers, excluding the two earlier random batches to maximize new coverage.
Downgraded simlingo-vision-only-closed-loop-autonomous-driving-with-language-action-alignment to audited-needs-tightening after removing an overstated claim about how much Action Dreaming improves closed-loop driving.
The other 9 sampled pages remained materially faithful on review, including alpamayo-r1, gaussianocc, drivetransformer, voxposer, senna, unisim, variational-lossy-autoencoder, emerging-properties-in-self-supervised-vision-transformers, and momad.
Corpus totals after the pass: 185 solid, 9 needs-tightening, 3 needs-correction, 0 unchecked.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260412)

Audited a second deterministic random sample of 10 paper pages against the primary papers.
Downgraded multi-scale-context-aggregation-by-dilated-convolutions to audited-needs-tightening after correcting the context-module description, and downgraded occgen-generative-multi-modal-3d-occupancy-prediction-for-autonomous-driving to audited-needs-correction after fixing swapped camera-only vs. LiDAR-only benchmark numbers.
Tightened lift-splat-shoot-encoding-images-from-arbitrary-camera-rigs-by-implicitly-unprojecting-to-3d while keeping it audited-needs-correction, removing unsupported transfer/runtime overclaims.
The other 7 sampled pages remained materially faithful on review; corpus totals now stand at 186 solid, 8 needs-tightening, 3 needs-correction, 0 unchecked.

[2026-04-11] audit | Remaining unchecked source pages

Audited the final 14 pages still marked paper-faithfullness: unchecked against their primary sources.
Marked 10 of those pages audited-solid and 4 audited-needs-tightening, with wording tightened on the course/blog-style entries cs231n, the-first-law-of-complexodynamics, the-unreasonable-effectiveness-of-recurrent-neural-networks, and understanding-lstm-networks.
Normalized 110 legacy audited-clean / audited-fixed labels to audited-solid so the corpus uses a single status legend.
Current corpus totals after the pass: 188 solid, 7 needs-tightening, 2 needs-correction, 0 unchecked.

[2026-04-11] audit | Full paper-corpus metadata validation

Validated all wiki/sources/papers/ entries at the source-identity level against primary records: 197 total pages, 187 arXiv-backed entries, 10 non-arXiv entries.
Fixed three broken source references: solve-synergy-of-language-vision-and-end-to-end-networks-for-autonomous-driving, simlingo-vision-only-closed-loop-autonomous-driving-with-language-action-alignment, and para-drive-parallelized-architecture-for-real-time-autonomous-driving.
Recorded the metadata-validation outcome in wiki/queries/paper-fact-check-tracker.md; the follow-up source-faithfulness pass and status normalization are logged above.

[2026-04-11] audit | Random 10-paper fact-check sample

Audited a deterministic random sample of 10 paper pages against the original papers and updated paper-faithfullness on all 10 to audited-solid.
Corrected hard factual issues in 5 pages: carla-an-open-urban-driving-simulator, surroundocc-multi-camera-3d-occupancy-prediction-for-autonomous-driving, self-improving-embodied-foundation-models, bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding, and drivedreamer-towards-real-world-driven-world-models.
Most issues were benchmark-value mixups, loss-function misstatements, incorrect venue/training metadata, or unsupported scope claims.
Recorded the batch outcome in wiki/queries/paper-fact-check-tracker.md for future audit coverage.

[2026-04-06] ingest | Gemma 3 Technical Report

Added paper wiki page: wiki/sources/papers/gemma-3-technical-report.md
Updated: wiki/sources/llm-seminal-papers.md (new open-weight multimodal section + ingested individually list), wiki/concepts/foundation-models.md (LLM section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
Citations: ~1120 (user-provided)
Tags: transformer, language-modeling, multimodal, foundation-model, vision-language-model, knowledge-distillation, mixture-of-experts, scaling, multilingual

[2026-04-06] ingest | Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)

Added paper wiki page: wiki/sources/papers/scaling-instruction-finetuned-language-models.md
Updated: wiki/sources/llm-seminal-papers.md (instruction tuning section + ingested individually list), wiki/concepts/foundation-models.md (new instruction tuning subsection)
Citations: ~3987 (user-provided)
Tags: nlp, transformer, instruction-tuning, chain-of-thought, foundation-model, language-modeling, scaling, multi-task

[2026-04-06] ingest | Qwen3 Technical Report

Added paper wiki page: wiki/sources/papers/qwen3-technical-report.md
Updated: wiki/sources/llm-seminal-papers.md (core architecture list + ingested individually list), wiki/concepts/foundation-models.md (LLM section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
Citations: ~3706 (user-provided)
Tags: nlp, language-modeling, transformer, mixture-of-experts, foundation-model, reasoning, multilingual, reinforcement-learning

[2026-04-06] ingest | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Added paper wiki page: wiki/sources/papers/deepseek-r1-incentivizing-reasoning-capability-in-llms-via-reinforcement-learning.md
Updated: wiki/sources/llm-seminal-papers.md (added "Reasoning via reinforcement learning" section + ingested individually list), wiki/concepts/machine-learning.md (reasoning section + key papers table), wiki/concepts/foundation-models.md (chain-of-thought section + key papers table), wiki/queries/open-questions.md (Q8 partial answer on GRPO for driving), wiki/taxonomies/research-map.md (LLM seminal papers count)
Citations: ~1920 (user-provided)
Tags: nlp, reinforcement-learning, language-modeling, reasoning, chain-of-thought, foundation-model, transformer, alignment

[2026-04-06] ingest | Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Added paper wiki page: wiki/sources/papers/tree-of-thoughts-deliberate-problem-solving-with-large-language-models.md
Updated: wiki/sources/llm-seminal-papers.md (added "Reasoning and search" section + ingested individually list)
Citations: ~3561 (user-provided)
Tags: nlp, reasoning, language-modeling, chain-of-thought, search, foundation-model, prompting

[2026-04-06] ingest | Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Added paper wiki page: wiki/sources/papers/gemini-25-pushing-the-frontier-with-advanced-reasoning-multimodality-long-context-and-next-generation-agentic-capabilities.md
Updated: wiki/sources/llm-seminal-papers.md (multimodal bridge section + ingested individually list), wiki/concepts/foundation-models.md (transformer and scaling section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
Citations: ~1943 (user-provided)
Tags: nlp, multimodal, foundation-model, transformer, mixture-of-experts, language-modeling, chain-of-thought, reasoning, agentic

[2026-04-06] ingest | On the Opportunities and Risks of Foundation Models

Added paper wiki page: wiki/sources/papers/on-the-opportunities-and-risks-of-foundation-models.md
Updated: wiki/concepts/foundation-models.md (new "Defining the paradigm" section + key papers table), wiki/concepts/machine-learning.md (new "Foundation model paradigm" section + key papers table), wiki/sources/llm-seminal-papers.md (new "Foundational surveys and frameworks" section)
Citations: ~6057 (user-provided; Semantic Scholar fetch unavailable)
Tags: foundation-model, nlp, computer-vision, robotics, multimodal, transformer, survey

[2026-04-06] ingest | BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Added paper wiki page: wiki/sources/papers/bevnext-reviving-dense-bev-frameworks-for-3d-object-detection.md
Updated: wiki/concepts/perception.md (BEV revolution section + key papers table), wiki/sources/autonomous-driving-seminal-papers.md (perception seed list)
Citations: ~80 (user-provided; Semantic Scholar and AlphaXiv overview unavailable)
Tags: autonomous-driving, perception, bev, transformer, computer-vision, 3d-object-detection, cnn, depth-estimation

[2026-04-06] ingest | Emerging Properties in Self-Supervised Vision Transformers (DINO)

Added paper wiki page: wiki/sources/papers/emerging-properties-in-self-supervised-vision-transformers.md
Updated: wiki/concepts/foundation-models.md (vision-language models section), wiki/sources/llm-seminal-papers.md (multimodal bridge list)
Citations: ~10798 (user-provided)
Tags: computer-vision, self-supervised-learning, transformer, vision-transformer, knowledge-distillation, image-classification, foundation-model

[2026-04-06] ingest | YOLOv10: Real-Time End-to-End Object Detection

Added paper wiki page: wiki/sources/papers/yolov10-real-time-end-to-end-object-detection.md
Updated: wiki/concepts/perception.md (key papers table)
Citations: ~5988 (user-provided)
Tags: computer-vision, object-detection, cnn, end-to-end, real-time, perception

[2026-04-06] ingest | Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Updated paper wiki page from seed to active: wiki/sources/papers/learning-transferable-visual-models-from-natural-language-supervision.md
Updated: frontmatter (venue ICML 2021, citations 57987, proper tags, arxiv_id), added results comparison table, added linear probe figure, expanded Connections with descriptive annotations
Already cross-referenced in: wiki/concepts/foundation-models.md, wiki/concepts/machine-learning.md, wiki/concepts/vision-language-action.md, wiki/sources/llm-seminal-papers.md
Citations: 57987 (user-provided)
Tags: computer-vision, multimodal, foundation-model, transformer, cnn, image-classification, nlp

[2026-04-06] update | Training Compute-Optimal Large Language Models (Chinchilla)

Updated existing paper wiki page: wiki/sources/papers/training-compute-optimal-large-language-models.md
Updated frontmatter: type source-summary -> paper, citations 2973 -> 4116, added arxiv_id, updated tags
Enriched connections with GPT-3 link and descriptive annotations
Citations: ~4116 (user-provided)
Tags: nlp, language-modeling, transformer, foundation-model, scaling

[2026-04-05] ingest | VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Added paper wiki page: wiki/sources/papers/voxposer-composable-3d-value-maps-for-robotic-manipulation-with-language-models.md
Updated: wiki/sources/vla-and-driving.md (general VLA section), wiki/concepts/robotics.md (key papers table), wiki/queries/open-questions.md (Q4/Q5 partial answers)
Citations: ~450 (Semantic Scholar API was unavailable; approximate count used)
Tags: robotics, manipulation, language-modeling, multimodal, planning, zero-shot

[2026-04-05] scaffold | Initial vault bootstrap

Created the initial wiki structure for ML, autonomous driving, robotics, VLA, e2e systems, perception, prediction, planning, and foundation models.
Added AGENTS.md to define ingest, query, and lint workflows for the LLM maintainer.
Added seed source-program pages for canonical paper collection and future ingest.
Added a Flask frontend scaffold targetable to Railway for hosted browsing.

[2026-04-05] ingest | Initial corpus batch 01

Added 27 source-summary pages under wiki/sources/papers/.
Seeded the first real corpus across autonomous driving, robotics/VLA, and foundation-model papers.
Added wiki/sources/initial-corpus-batch-01.md to group the batch and make it navigable.
Updated the top-level index so the new corpus is discoverable from the vault entry point.

[2026-04-05] ingest | Ilya Top 30 corpus

Ingested all 30 papers from Ilya Sutskever's canonical reading list as full source-summary pages.
Updated wiki/sources/ilya-top-30.md from placeholder to canonical list with thematic clusters and wiki links.
Papers span architectures (Transformer, ResNet, AlexNet, ViT), sequence modeling (RNN, LSTM, Pointer Networks), information theory (MDL, Kolmogorov complexity), complexity theory, scaling laws, diffusion models, and chain-of-thought prompting.

[2026-04-05] ingest | AutoVLA corpus (batch 02)

Ingested 18 driving VLA papers (2018–2025) from the AutoVLA analysis corpus.
Updated 3 existing papers (CIL, DriveLM, LMDrive) from seed to complete status with rich technical content.
Created 15 new paper pages: BDD-X, Talk2Car, GPT-Driver, DriveGPT4, VLP, Reason2Drive, SimLingo, ORION, EMMA, DriveMLM, Alpamayo-R1, Senna, WoTE, AlphaDrive, DriveMoE.
Updated wiki/sources/vla-and-driving.md with three-wave taxonomy and design axes table.
Updated wiki/sources/autonomous-driving-seminal-papers.md with batch 02 entries.

[2026-04-06] ingest | Batch 03 -- robotics VLA, world models, driving transformers

Added 5 new paper pages:
GR00T N1 (2503.14734) -- open dual-system VLA for humanoid robots, 602 citations
Gemini Robotics (2503.20020) -- Gemini 2.0 VLA for physical manipulation, cloud-local hybrid
Cosmos (2501.03575) -- world foundation model platform for physical AI, 515 citations
AutoVLA (2506.13757) -- adaptive dual-process reasoning VLA for driving with RL, 110 citations
DriveTransformer (2503.07656) -- parallel-task sparse transformer for E2E driving, ICLR 2025, 91 citations
Updated wiki/sources/vla-and-driving.md with batch 03 entries and Wave 3 additions (AutoVLA, DriveTransformer)
Updated wiki/concepts/robotics.md with GR00T N1, Gemini Robotics, and Cosmos sections and key papers table
Updated wiki/concepts/foundation-models.md with world foundation models and robotics foundation models sections
Updated wiki/concepts/autonomous-driving.md with AutoVLA and DriveTransformer in Era 3 and key papers table
Updated wiki/concepts/vision-language-action.md with AutoVLA adaptive reasoning in Wave 3
Updated wiki/concepts/end-to-end-architectures.md with AutoVLA and DriveTransformer architectural variants and key papers
Updated wiki/concepts/planning.md with AutoVLA and DriveTransformer in key papers table

[2026-04-05] ingest | Batch 04 (self-supervised driving, temporal E2E, BEV, world models, embodied RL)

Ingested 5 papers: S4-Driver, BridgeAD, Self-Improving Embodied Foundation Models, GaussianLSS, Drive-OccWorld.
S4-Driver (CVPR 2025, 16 cites): self-supervised MLLM for annotation-free driving, achieves 0.31m L2 on nuScenes beating supervised methods.
BridgeAD (CVPR 2025, 31 cites): multi-step temporal queries for history-enhanced E2E driving, 19% L2 improvement over UniAD, strong closed-loop safety.
Self-Improving EFM (arXiv 2025, 18 cites): Google DeepMind, steps-to-go reward enables autonomous robot self-improvement, 10% data + 1% RL beats 80% imitation.
GaussianLSS (CVPR 2025, 18 cites): depth uncertainty + Gaussian Splatting for BEV perception, within 0.4% of SOTA at 2.5x speed and 3.8x less memory.
Drive-OccWorld (AAAI 2024, 49 cites): 4D occupancy world model for planning, 33% L2 reduction at 1s vs UniAD, action-controllable neural simulation.
Updated wiki/sources/vla-and-driving.md with Batch 04 entries.
Updated concept pages: perception, planning, prediction, robotics, autonomous-driving, foundation-models, end-to-end-architectures.

[2026-04-05] ingest | Batch 05 (VLA, world models, momentum planning, distillation)

Ingested 5 papers with AlphaXiv overviews and Semantic Scholar citation data:
OpenDriveVLA (2503.23463, AAAI 2025, 109 cites) -- open-source VLA with hierarchical 3D scene queries, 0.33m L2 at 0.5B-7B scale
HERMES (2501.14729, arXiv 2025, 38 cites) -- unified world model for simultaneous 3D scene understanding and future generation via world queries
MomAD (2503.03125, CVPR 2025, 60 cites) -- momentum-aware planning for temporal consistency, 0.60m L2, +16.3% closed-loop success vs VAD
GaussianWorld (2412.10373, CVPR 2024, 59 cites) -- 3D Gaussian world model for streaming occupancy prediction, +2% mIoU without inference overhead
DiMA (2501.09757, CVPR 2025, 34 cites) -- distill MLLM reasoning into vision planner, 80% collision reduction, LLM discarded at inference
Updated wiki/sources/vla-and-driving.md with Wave 3 additions and Batch 05 ingested papers list
Updated wiki/concepts/autonomous-driving.md with Era 3 additions and key papers table
Updated wiki/concepts/planning.md with MomAD, OpenDriveVLA, DiMA in key papers table
Updated wiki/concepts/perception.md with GaussianWorld and HERMES in occupancy section and key papers table
Updated wiki/concepts/prediction.md with MomAD temporal consistency section and key papers table
Updated wiki/concepts/end-to-end-architectures.md with VLA variants, open problems, and key papers table
Updated wiki/concepts/vision-language-action.md with OpenDriveVLA and DiMA in Wave 3

[2026-04-05] ingest | Batch 06 (diffusion/flow planning, scaling laws, RL planning, VLM-E2E synergy, robotics VLA/diffusion)

Ingested 8 papers spanning generative planning, scaling laws, RL-based planning, VLM-E2E synergy, embodied CoT, and diffusion robotics:
DiffusionDrive (2411.15139, CVPR 2025 Highlight) -- truncated diffusion for E2E driving, 88.1 PDMS on NAVSIM, 2 denoising steps at 45 FPS
DriveGPT (2412.14415, ICML 2025, Waymo) -- first scaling laws for driving behavior models, 1.1B params, 100M+ demonstrations
GoalFlow (2503.05689, CVPR 2025) -- goal-driven flow matching, 90.3 PDMS on NAVSIM with single-step inference
LAW (2406.08481, ICLR 2025) -- self-supervised latent world model, SOTA on nuScenes+NAVSIM+CARLA
CarPlanner (2502.19908, CVPR 2025) -- first RL planner to beat IL+rule-based on nuPlan, consistency-regularized autoregressive
SOLVE (2505.16805, CVPR 2025) -- Sequential Q-Former + Trajectory CoT for VLM-E2E synergy
ECoT (2407.08693, Stanford/Berkeley 2025) -- embodied Chain-of-Thought for VLAs, +28% generalization on OpenVLA
RDT-1B (2410.07864, ICLR 2025, Tsinghua) -- largest diffusion transformer for bimanual manipulation, 1.2B params
Updated wiki/sources/vla-and-driving.md with Wave 3 additions (6 driving papers) and Batch 06 ingested papers list
Key themes: generative trajectory planning (diffusion vs. flow matching), scaling laws for driving, RL surpassing IL, CoT reasoning for embodied agents, bimanual diffusion transformers

[2026-04-05] ingest | Batch 07 (cross-embodiment robotics VLA + 3D occupancy perception)

Ingested 8 papers with AlphaXiv overviews and Semantic Scholar citation data:
UniAct (2501.10105, CVPR 2025, 60 cites) -- universal action space via VQ codebooks for cross-embodiment VLA, 0.5B beats 14x larger models
Dita (2503.19757, ICCV 2025, 54 cites) -- DiT-based VLA with in-context diffusion conditioning, 10-shot real-world adaptation, 334M params
Embodiment Scaling Laws (2505.05753, CoRL 2025, 10 cites) -- first power-law scaling for embodiment diversity across ~1000 robots, zero-shot sim-to-real
SmolVLA (2506.01844, arXiv 2025, 224 cites) -- 450M VLA from Hugging Face competitive with 3.3B models, async inference, single-GPU training
GaussianFormer-2 (2412.04384, CVPR 2025, 57 cites) -- probabilistic Gaussian superposition, 8.9% of Gaussians needed, 51% memory savings
OccMamba (2408.09859, CVPR 2025, 32 cites) -- first Mamba-based occupancy network, +5.1% IoU, 65% faster inference via linear complexity
GaussTR (2412.13193, CVPR 2025, 41 cites) -- self-supervised 3D occupancy via foundation model alignment, zero-shot 12.27 mIoU without 3D annotations
BEVDiffuser (2502.19694, CVPR 2025, 14 cites) -- training-only diffusion for BEV denoising, +12.3% mAP, zero inference overhead
Updated wiki/sources/vla-and-driving.md with UniAct, Dita, SmolVLA in general VLA list and Batch 06 ingested papers
Updated wiki/concepts/robotics.md with cross-embodiment VLA section (UniAct, Dita, SmolVLA, Embodiment Scaling Laws) and key papers table
Updated wiki/concepts/perception.md with GaussianFormer-2, OccMamba, GaussTR, BEVDiffuser in occupancy section and key papers table
Key themes: universal action representations vs. model scale, diffusion/flow VLA architectures, embodiment as a scaling axis, efficient 3D occupancy (Gaussian/Mamba/self-supervised/diffusion denoising)

[2026-04-05] ingest | Batch 08 (Physical Intelligence VLA family + robotics VLA advances)

Ingested 8 papers spanning the Physical Intelligence pi0 family, VLA training methodology, action tokenization, spatial reasoning, and dexterous manipulation:
pi0 (2410.24164, arXiv 2024, 1381 cites) -- flow matching VLA on PaliGemma 3B, 7 platforms, 68 tasks. The reference VLA from Physical Intelligence.
pi0.5 (2504.16054, CoRL 2025, 681 cites) -- hierarchical VLA with five-source co-training, first to do 10-15 min tasks in unseen real homes
pi0.6 (2511.14759, arXiv 2025, 93 cites) -- RECAP offline RL for VLA self-improvement, doubled task throughput, halved failure rates
FAST (2501.09747, RSS 2025, 353 cites) -- DCT+BPE action tokenizer for VLAs, 2x-13x compression, 5x faster training
OpenVLA-OFT (2502.19645, arXiv 2025, 364 cites) -- parallel decoding fine-tuning recipe, 76.5% to 97.1% on LIBERO, 26x inference speedup
SpatialVLA (2501.15830, arXiv 2025, 292 cites) -- Ego3D position encoding + adaptive action grids, 1.1M real episodes, 73% spatial accuracy
DexVLA (2502.05855, CoRL 2025, 140 cites) -- 2B VLM + 1B diffusion expert, 0.92 success on shirt folding, three-stage embodied curriculum
Knowledge Insulation (2505.23705, NeurIPS 2025 Spotlight, 68 cites) -- stop-gradient + co-training prevents VLM degradation during VLA training, 7.5x faster convergence
Updated wiki/sources/vla-and-driving.md with General VLA foundations entries and Batch 07 ingested papers list
Key themes: flow matching vs. diffusion for action generation, action tokenization (FAST) vs. continuous (pi0), VLA self-improvement via RL (pi0.6), knowledge preservation during fine-tuning (insulation), spatial reasoning (SpatialVLA), scaling action experts (DexVLA)

[2026-04-05] ingest | Batch 09 (world models, parallel E2E, generative driving, evaluation, LLM-for-driving)

Ingested 5 papers with AlphaXiv overviews and Semantic Scholar citation data:
DriveDreamer (2309.09777, ECCV 2024, ~452 cites) -- first real-world-driven world model for driving, diffusion-based Auto-DM with two-stage training, 0.29m L2, 21% collision reduction
PARA-Drive (CVPR 2024, NVIDIA, ~179 cites) -- systematic design space exploration of modular E2E stacks, fully parallel architecture with implicit BEV communication, 2-3x speedup
GenAD (2402.11502, ECCV 2024, ~189 cites) -- E2E driving as generative modeling, VAE trajectory prior + instance-centric scene representation, 0.91m L2, 0.43% collision rate SOTA
Is Ego Status All You Need? (2312.03031, CVPR 2024, NVIDIA/Nanjing, ~199 cites) -- exposes that simple Ego-MLP matches complex E2E models on nuScenes open-loop, proposes Curb Collision Rate metric
Driving with LLMs (2310.01957, ICRA 2024, Wayve, ~328 cites) -- first concrete LLM-for-driving with object-level vector modality, LLaMA-7B + LoRA, explainable decisions
Updated wiki/sources/vla-and-driving.md with Wave 2 additions and Batch 08 ingested papers list
Key themes: world models for driving (DriveDreamer), parallel vs. sequential E2E design (PARA-Drive), generative trajectory modeling (GenAD), evaluation methodology critique (Ego Status), LLM integration for explainability (Driving with LLMs)

[2026-04-05] ingest | Batch 10 (orchestration, cross-embodiment, async planning, Gaussian representations, occupancy world models)

Ingested 6 papers with AlphaXiv overviews and Semantic Scholar citation data:
AutoRT (2401.12963, arXiv 2024, Google DeepMind, 110 cites) -- foundation model orchestration for large-scale robot data collection, 77K episodes, 53 robots, Robot Constitution for safety
HPT (2409.20537, NeurIPS 2024, 134 cites) -- stem-trunk-head architecture for cross-embodiment scaling, first robotics scaling laws across data/diversity/model size/compute, 10-30% sim gains, 20%+ real gains
AsyncDriver (2406.14556, ECCV 2024, 41 cites) -- asynchronous LLM-planner decoupling, Llama2-13B guidance at sparse intervals, ~40% cost reduction with ~1% accuracy loss on nuPlan
GaussianFormer (2405.17429, ECCV 2024, 128 cites) -- sparse 3D semantic Gaussian occupancy representation, 5-6x memory reduction vs dense methods with ~2% mIoU trade-off
DrivingGaussian (2312.07920, CVPR 2024, 398 cites) -- composite Gaussian splatting for dynamic driving scenes, IS3G + CDGG, 28.74 PSNR on nuScenes, LiDAR-prior integration
OccWorld (2311.16038, ECCV 2024, 198 cites) -- original 3D occupancy world model, VQ-VAE tokenization + GPT-like spatial-temporal transformer, joint scene-ego forecasting, competitive with UniAD sans HD maps
Updated wiki/sources/vla-and-driving.md with Batch 09 ingested papers list
Updated wiki/concepts/robotics.md with AutoRT/HPT data collection and cross-embodiment sections + key papers
Updated wiki/concepts/perception.md with GaussianFormer, DrivingGaussian, OccWorld in occupancy section + key papers
Updated wiki/concepts/planning.md with AsyncDriver and OccWorld in key papers
Updated wiki/concepts/autonomous-driving.md with all 4 driving papers in key papers table
Updated wiki/concepts/foundation-models.md with AutoRT and HPT in key papers table
Key themes: foundation models as orchestrators (not just controllers), robotics scaling laws, asynchronous LLM integration for real-time planning, Gaussian representations as efficient alternative to dense voxels, occupancy-based world models

[2026-04-05] ingest | Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

Added paper wiki page: wiki/sources/papers/think-twice-before-driving-towards-scalable-decoders-for-end-to-end-autonomous-driving.md
Updated: wiki/concepts/planning.md, wiki/concepts/end-to-end-architectures.md, wiki/sources/autonomous-driving-seminal-papers.md
Citations: unavailable (Semantic Scholar fetch failed)
Tags: autonomous-driving, end-to-end, planning, imitation-learning, transformer, perception

[2026-04-05] synthesis | Wiki-wide updates from new corpus

Updated wiki/concepts/vision-language-action.md from seed to active with three-wave analysis, design axes, and emerging consensus.
Updated wiki/syntheses/research-thesis.md with AutoVLA evidence (supporting, refining, and partially challenging the thesis). Confidence raised to medium.
Updated wiki/queries/open-questions.md with 8 new questions from AutoVLA analysis and partial answers.
Updated wiki/taxonomies/research-map.md with source program table, routing guide, and VLA sub-taxonomy.
Updated index.md descriptions and log.md with ingest records.

[2026-04-05] ingest | Batch 11 (Gaussian occupancy cluster, radar fusion, sparse E2E, pseudo-simulation, robotics VLA)

Ingested 8 papers with AlphaXiv overviews and Semantic Scholar citation data:
GaussianOcc (2408.11447, ICCV 2025, 47 cites) -- fully self-supervised 3D occupancy via Gaussian splatting (no GT pose), 2.7x faster training, 5x faster rendering
GaussianFlowOcc (2502.17288, ICCV 2025, 19 cites) -- sparse Gaussian occupancy + temporal flow, 51%+ mIoU improvement, 50x faster inference with 2D pseudo-labels
RaCFormer (2412.12725, CVPR 2025, 15 cites) -- radar-camera fusion via query-based dual-view attention + Doppler dynamic catcher, 64.9% mAP surpassing LiDAR-only
GaussRender (2502.05040, ICCV 2025, 13 cites) -- plug-and-play Gaussian rendering loss for 3D-2D projective consistency, +3.75 mIoU on TPVFormer, zero inference overhead
VPP (2412.14803, ICML 2025 Spotlight, 139 cites) -- video diffusion as predictive visual encoder for robot policies, +18.6% on CALVIN, +31.6% real-world dexterous
Helix (Figure AI Technical Report, Feb 2025) -- first whole-body humanoid VLA, System 1+2 dual architecture, 35 DoF at 200Hz, dual-robot coordination
NAVSIM v2 (2506.04218, CoRL 2025, 62 cites) -- pseudo-simulation evaluation via 3D Gaussian Splatting, R^2=0.8 with closed-loop, de facto E2E driving benchmark
SparseDrive (2405.19620, ICRA 2025, 181 cites) + SparseDriveV2 (2603.29163, 2026) -- fully sparse E2E driving with factorized trajectory vocabulary (262K candidates), 92.0 PDMS NAVSIM SOTA
Updated wiki/concepts/perception.md with Gaussian occupancy cluster (GaussianOcc, GaussianFlowOcc, GaussRender), radar-camera fusion (RaCFormer), and key papers table
Updated wiki/concepts/autonomous-driving.md with all 8 papers in key papers table
Updated wiki/concepts/planning.md with SparseDrive, SparseDriveV2, NAVSIM v2 in key papers table
Updated wiki/concepts/end-to-end-architectures.md with SparseDrive, SparseDriveV2, NAVSIM v2 in key papers table
Updated wiki/concepts/robotics.md with VPP and Helix in key papers table
Updated wiki/concepts/vision-language-action.md with VPP and Helix in robotics VLA frontier section
Key themes: Gaussian splatting as unified primitive for occupancy/BEV/simulation, radar replacing LiDAR, scoring-based planning scaling laws, pseudo-simulation bridging open/closed-loop, video diffusion for robot policies, dual-system humanoid VLA

[2026-04-05] ingest | Drive as You Speak

Added paper wiki page: wiki/sources/papers/drive-as-you-speak-enabling-human-like-interaction-with-large-language-models-in-autonomous-vehicles.md
Updated: wiki/sources/vla-and-driving.md, wiki/queries/open-questions.md
Citations: 0 (Semantic Scholar unavailable)
Tags: autonomous-driving, llm, planning, nlp, multimodal, human-interaction

[2026-04-05] ingest | Agent-Driver: A Language Agent for Autonomous Driving

Added paper wiki page: wiki/sources/papers/a-language-agent-for-autonomous-driving.md
Updated: wiki/sources/vla-and-driving.md, wiki/concepts/planning.md, wiki/taxonomies/research-map.md
Citations: 140 (Semantic Scholar)
Tags: autonomous-driving, llm, planning, reasoning, chain-of-thought, end-to-end

[2026-04-05] ingest | RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Added paper wiki page: wiki/sources/papers/robocat-a-self-improving-generalist-agent-for-robotic-manipulation.md
Updated: wiki/concepts/robotics.md, wiki/sources/vla-and-driving.md
Citations: 0 (Semantic Scholar fetch failed)
Tags: robotics, transformer, imitation-learning, multimodal, foundation-model, multi-embodiment

[2026-04-05] ingest | OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

Added paper wiki page: wiki/sources/papers/occformer-dual-path-transformer-for-vision-based-3d-semantic-occupancy-prediction.md
Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md, wiki/taxonomies/research-map.md
Citations: ~280 (Semantic Scholar fetch failed, estimated from known data)
Tags: autonomous-driving, perception, transformer, computer-vision, occupancy, 3d-semantic-occupancy, end-to-end

[2026-04-05] ingest | BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Added paper wiki page: wiki/sources/papers/bevformer-v2-adapting-modern-image-backbones-to-birds-eye-view-recognition-via-perspective-supervision.md
Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md, wiki/sources/papers/bevformer-learning-birds-eye-view-representation-from-multi-camera-images-via-spatiotemporal-transformers.md
Citations: ~250 (Semantic Scholar fetch failed, estimated)
Tags: autonomous-driving, perception, bev, transformer, computer-vision, end-to-end

[2026-04-05] ingest | SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving

Added paper wiki page: wiki/sources/papers/surroundocc-multi-camera-3d-occupancy-prediction-for-autonomous-driving.md
Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md
Citations: ~350 (Semantic Scholar fetch failed, estimated)
Tags: autonomous-driving, perception, occupancy, 3d-reconstruction, computer-vision, multi-camera, cnn

[2026-04-05] ingest | FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

Added paper wiki page: wiki/sources/papers/flashocc-fast-and-memory-efficient-occupancy-prediction-via-channel-to-height-plugin.md
Updated: wiki/concepts/perception.md (occupancy section + key papers table)
Citations: 0 (Semantic Scholar fetch failed)
Tags: autonomous-driving, perception, 3d-occupancy, bev, computer-vision, cnn, efficient-inference

[2026-04-06] update | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

Updated existing paper wiki page: wiki/sources/papers/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale.md
Updated citations: 60022 → 91128 (user-provided)
Updated frontmatter: year to 2021, type to paper, status to active, added arxiv_id field, added foundation-model tag
Fixed broken wikilink in wiki/sources/ilya-top-30.md (entry #29 pointed to wrong slug)
Expanded Connections section with CLIP and BERT links
Tags: ilya-30, vision-transformer, computer-vision, transformer, image-classification, foundation-model

[2026-04-05] ingest | FB-BEV: BEV Representation from Forward-Backward View Transformations

Added paper wiki page: wiki/sources/papers/fb-bev-bev-representation-from-forward-backward-view-transformations.md
Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md
Citations: ~150 (Semantic Scholar unavailable, estimated)
Tags: autonomous-driving, perception, bev, transformer, computer-vision

[2026-04-06] ingest | Diffusion Models Beat GANs on Image Synthesis

Added paper wiki page: wiki/sources/papers/diffusion-models-beat-gans-on-image-synthesis.md
Updated: wiki/sources/papers/denoising-diffusion-probabilistic-models.md (added connection), wiki/concepts/machine-learning.md (self-supervised section + key papers table)
Citations: 13548 (user-provided)
Tags: computer-vision, diffusion, generative-models, image-generation, classifier-guidance

[2026-04-06] ingest | Exploring Simple Siamese Representation Learning (SimSiam)

Added paper wiki page: wiki/sources/papers/exploring-simple-siamese-representation-learning.md
Updated: wiki/concepts/machine-learning.md (self-supervised section + key papers table), wiki/concepts/foundation-models.md (vision-language models section)
Citations: 6444 (user-provided; Semantic Scholar fetch unavailable)
Tags: computer-vision, self-supervised-learning, representation-learning, siamese-networks, contrastive-learning

[2026-04-06] ingest | Prefix-Tuning: Optimizing Continuous Prompts for Generation

Added paper wiki page: wiki/sources/papers/prefix-tuning-optimizing-continuous-prompts-for-generation.md
Updated: wiki/sources/llm-seminal-papers.md (added PEFT section + wikilink), wiki/concepts/foundation-models.md (LLM section with PEFT context), wiki/concepts/machine-learning.md (new parameter-efficient adaptation section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
Citations: 6753 (user-provided; Semantic Scholar fetch unavailable)
Tags: nlp, transformer, parameter-efficient, language-modeling, fine-tuning

[2026-04-06] ingest | High-Resolution Image Synthesis with Latent Diffusion Models

Added paper wiki page: wiki/sources/papers/high-resolution-image-synthesis-with-latent-diffusion-models.md
Updated: wiki/concepts/foundation-models.md (diffusion models section + key papers table), wiki/taxonomies/research-map.md (added generative models routing)
Citations: 31987 (user-provided; Semantic Scholar fetch unavailable)
Tags: diffusion, generative-models, computer-vision, image-generation, foundation-model, transformer

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260414)

Audited another non-overlapping deterministic sample of 10 paper summaries against the original papers.
Downgraded pi0-a-vision-language-action-flow-model-for-general-robot-control to audited-needs-tightening after removing an unsupported cross-embodiment transfer claim.
Downgraded rdt-1b-a-diffusion-foundation-model-for-bimanual-manipulation to audited-needs-tightening after correcting the limitation section to reflect real-robot, not simulation-first, evaluation.
Corpus status after this pass: audited-solid 183, audited-needs-tightening 11, audited-needs-correction 3, unchecked 0.

[2026-04-11] audit | Random 20-paper serious-error check (seed 20260415)

Audited a fresh non-overlapping deterministic sample of 20 paper summaries against the primary papers, with emphasis on serious benchmark/setup errors rather than soft phrasing drift.
Downgraded flashocc-fast-and-memory-efficient-occupancy-prediction-via-channel-to-height-plugin to audited-needs-correction after fixing materially wrong mIoU / speed / memory claims and correcting the plug-in baseline framing.
Downgraded rt-2-vision-language-action-models-transfer-web-knowledge-to-robotic-control to audited-needs-tightening after removing an unsupported quantified chain-of-thought improvement claim.
The other 18 sampled pages did not show serious factual failures and were left unchanged; actual current frontmatter totals are 164 solid, 15 clean, 16 fixed, 1 needs-tightening, and 1 needs-correction.

[2026-04-11] audit | Random 20-paper sample (seed 20260416)

Audited a fresh 20-paper sample chosen to avoid overlap with all previous random samples recorded in wiki/queries/paper-fact-check-tracker.md.
Corrected gaussianformer-2-probabilistic-gaussian-superposition-for-efficient-3d-occupancy-prediction after the summary mixed a 25.6K-Gaussian ablation with the 12.8K-Gaussian nuScenes main result and therefore cited the wrong main-result resource figures.
Tightened chauffeurnet-learning-to-drive-by-imitating-the-best-and-synthesizing-the-worst by replacing unsupported aggregate collision/off-road claims with the paper's actual scenario-based closed-loop findings and real-world deployment description.
The other 18 sampled pages were materially consistent with their source papers.
Current frontmatter totals: audited-solid 162, audited-clean 15, audited-fixed 18, audited-needs-tightening 1, audited-needs-correction 1.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260417)

Audited a fresh non-overlapping deterministic sample of 10 paper summaries against the original papers.
Corrected vectornet-encoding-hd-maps-and-agent-dynamics-from-vectorized-representation after the summary understated the ConvNet FLOP gap; the paper reports 10.56G vs 0.041G FLOPs (about 200x fewer / 99.6% lower), not 70% fewer.
Corrected smolvla-a-vision-language-action-model-for-affordable-robotics after the overview and results overstated the memory advantage; the paper says 6x less memory than pi0, not 7x.
The other 8 sampled pages were materially consistent with their source papers.
Current frontmatter totals: audited-solid 160, audited-clean 15, audited-fixed 20, audited-needs-tightening 1, audited-needs-correction 1.