ESC

Log

[2026-04-16] audit | Batch 9 factuality audit — 7 papers

  • Audited 7 paper wiki pages against source PDFs (Mistral OCR): drivetransformer, drivevlm, driving-gaussian, driving-with-llms, ecot, embodiment-scaling-laws, fb-bev.
  • Fixed 3 factual errors in fb-bev:
  • Results table: removed fabricated BEVFormer v2 | ResNet-101 | 61.7 | 52.8 row (not in paper). Corrected FB-BEV backbone from ResNet-101 to V2-99 and replaced with actual paper Table 2 baselines (SOLOFusion, BEVStereo, BEVDepth).
  • Ablation table (forward vs. backward vs. combined): replaced fabricated test-set numbers (58.1/47.9, 59.3/49.5, 62.4/54.2) with actual val-set numbers from paper Table 1 (R50, no temporal, no depth).
  • "Effect of 3D Pre-training" table: replaced three-row fabricated pre-training ablation (59.8/50.1, 61.2/52.4, 62.4/54.2) with actual val-set depth supervision comparison from Table 1 (47.9/35.0 → 49.8/37.8). Status set to audited-corrected.
  • 6 other papers verified clean against paper source: DriveTransformer (all Bench2Drive + nuScenes numbers confirmed), DriveVLM (Qwen-VL 9.6B, 410ms, nuScenes numbers confirmed), DrivingGaussian (PSNR/SSIM/LPIPS/rendering time confirmed), Driving-with-LLMs (all perception/action/QA table numbers confirmed), ECoT (key claims confirmed; note that per-category breakdown numbers in results table may be inaccurate summaries), Embodiment Scaling Laws (GENBOT-1K counts, compute stats, CoRL venue all confirmed).
  • Report written to .grounding/reports/batch9.md.

[2026-04-16] audit | Batch 14 factuality audit — 7 papers

  • Audited 7 paper wiki pages against source PDFs (Mistral OCR): learning-lane-graph-representations-for-motion-forecasting, learning-transferable-visual-models-from-natural-language-supervision, lift-splat-shoot-encoding-images-from-arbitrary-camera-rigs-by-implicitly-unprojecting-to-3d, llama-2-open-foundation-and-fine-tuned-chat-models, llarva-vision-action-instruction-tuning-enhances-robot-learning, llms-cant-plan-but-can-help-planning-in-llm-modulo-frameworks, lmdrive-closed-loop-end-to-end-driving-with-large-language-models.
  • Fixed 2 factual errors in learning-transferable-visual-models-from-natural-language-supervision (CLIP):
  • Architecture section incorrectly named ViT-L/14 as the best-performing model. Corrected to ViT-L/14@336px, which the paper explicitly states is the canonical "CLIP" result (Section 2.5: "all results reported in this paper as 'CLIP' use this model which we found to perform best").
  • Results table row label corrected from "Zero-shot CLIP (ViT-L/14)" to "Zero-shot CLIP (ViT-L/14@336px)".
  • Note: both ViT-L/14 and ViT-L/14@336px achieve 76.2% on ImageNet (Table 11), so the 76.2% figure itself is correct. Status set to audited-fixed.
  • 6 other papers verified clean: LaneGCN (K=6 minADE=0.87m, minFDE=1.36m confirmed), LSS (nuScenes IoU table confirmed), Llama 2 (all benchmark numbers confirmed including previously-fixed GQA scope and LLaMA 1 65B TriviaQA=84.5, NQ=31.0), LLARVA (43.3%, +17.5%, +15% confirmed), LLMs Can't Plan (12%, 82%/70% case study results confirmed), LMDrive (all ablation scores confirmed).
  • Report written to .grounding/reports/batch14.md.

[2026-04-16] audit | Batch 5 factuality audit — 7 papers

  • Audited 7 paper wiki pages against source PDFs: diffusion-models-beat-gans-on-image-synthesis, dita-scaling-diffusion-transformer-for-generalist-vla-policy, emerging-properties-in-self-supervised-vision-transformers, end-to-end-driving-via-conditional-imitation-learning, end-to-end-learning-for-self-driving-cars, exploring-simple-siamese-representation-learning, fast-efficient-action-tokenization-for-vision-language-action-models.
  • Fixed 5 factual errors across 3 papers:
  • diffusion-models-beat-gans: BigGAN-deep 128x128 precision corrected 0.87→0.86, recall corrected 0.28→0.35; ADM-G 256x256 precision corrected 0.83→0.82, recall corrected 0.53→0.52; ADM-G+upsampling 512x512 precision corrected 0.87→0.84, recall corrected 0.42→0.53. (Source: paper Tables 5 and 6.)
  • emerging-properties-in-self-supervised-vision-transformers: Results bullet incorrectly stated "+3.5% over supervised ViT-S/16" — corrected to "+3.5% over best competing SSL methods (BYOL, MoCo v2, SwAV) on ViT-S/16". (Source: paper text, "DINO outperforms BYOL, MoCov2 and SwAV by +3.5%".)
  • exploring-simple-siamese-representation-learning: SimSiam 200-epoch ImageNet top-1 corrected 70.8→70.0. (Source: paper Table 4; 70.8 is the 400-epoch result.)
  • DITA, CIL (Codevilla), DAVE-2 (Bojarski/NVIDIA), and FAST all verified clean with no factual errors.
  • Report written to .grounding/reports/batch5.md.

[2026-04-11] audit | Targeted factuality audit — AlexNet + Hinton-van-Camp-1993

  • Audited imagenet-classification-with-deep-convolutional-neural-networks (AlexNet, NeurIPS 2012) against the NeurIPS proceedings PDF and official ILSVRC 2012 results page.
  • Fixed three errors: (1) Overview claimed single-model ILSVRC 2012 top-5 error was 18.9% — corrected to 15.3% ensemble / 16.4% single-model; 18.9% and 39.7% are ILSVRC-2010 results reported in the paper, not 2012 competition results. (2) Results bullet stated ensemble achieves 15.4% top-5 — corrected to 15.3% (official: 0.15315). (3) Overview margin-of-victory description updated to match corrected figures. Status set to audited-fixed.
  • Audited keeping-neural-networks-simple-by-minimizing-the-description-length-of-the-weights (Hinton & van Camp, COLT 1993) against Hinton's publication page and Semantic Scholar. Title, authors, year, and venue all verified correct. Primary PDF source was unreadable binary; relied on Hinton's own paper list and API metadata. Status set to audited-clean.

[2026-04-11] audit | Targeted factuality audit — transfuser + vad

  • Audited transfuser (2205.15997) and vad (2303.12077) against arxiv and AlphaXiv ground truth.
  • Fixed transfuser: auxiliary tasks listed "3D object detection" but the paper uses 2D vehicle detection (bounding boxes), not full 3D detection. Corrected in Key Contributions and ASCII diagram. Status set to audited-fixed.
  • Fixed vad: Overview incorrectly stated that VAD "directly influenced subsequent work like UniAD." UniAD was published at CVPR 2023 and is the prior state-of-the-art that VAD explicitly improves upon; VAD appeared at ICCV 2023. Corrected the direction of influence. Status set to audited-fixed.

[2026-04-11] audit | Targeted factuality audit — simlingo + talk2car

  • Audited simlingo and talk2car against arxiv and AlphaXiv ground truth.
  • Fixed one hard numerical error in simlingo: Action Dreaming success rates were reported as "28.22 to 72.96" but the paper (Table 5) gives baseline 24.52% and SimLingo 81.13%. Status set to audited-fixed.
  • talk2car checked out clean on all factual claims (dataset size 11,959 / 850 scenes, venue EMNLP-IJCNLP 2019, authors, AP50 metric). Status set to audited-clean.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260413)

  • Audited another deterministic random sample of 10 paper pages against the primary papers, excluding the two earlier random batches to maximize new coverage.
  • Downgraded simlingo-vision-only-closed-loop-autonomous-driving-with-language-action-alignment to audited-needs-tightening after removing an overstated claim about how much Action Dreaming improves closed-loop driving.
  • The other 9 sampled pages remained materially faithful on review, including alpamayo-r1, gaussianocc, drivetransformer, voxposer, senna, unisim, variational-lossy-autoencoder, emerging-properties-in-self-supervised-vision-transformers, and momad.
  • Corpus totals after the pass: 185 solid, 9 needs-tightening, 3 needs-correction, 0 unchecked.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260412)

  • Audited a second deterministic random sample of 10 paper pages against the primary papers.
  • Downgraded multi-scale-context-aggregation-by-dilated-convolutions to audited-needs-tightening after correcting the context-module description, and downgraded occgen-generative-multi-modal-3d-occupancy-prediction-for-autonomous-driving to audited-needs-correction after fixing swapped camera-only vs. LiDAR-only benchmark numbers.
  • Tightened lift-splat-shoot-encoding-images-from-arbitrary-camera-rigs-by-implicitly-unprojecting-to-3d while keeping it audited-needs-correction, removing unsupported transfer/runtime overclaims.
  • The other 7 sampled pages remained materially faithful on review; corpus totals now stand at 186 solid, 8 needs-tightening, 3 needs-correction, 0 unchecked.

[2026-04-11] audit | Remaining unchecked source pages

  • Audited the final 14 pages still marked paper-faithfullness: unchecked against their primary sources.
  • Marked 10 of those pages audited-solid and 4 audited-needs-tightening, with wording tightened on the course/blog-style entries cs231n, the-first-law-of-complexodynamics, the-unreasonable-effectiveness-of-recurrent-neural-networks, and understanding-lstm-networks.
  • Normalized 110 legacy audited-clean / audited-fixed labels to audited-solid so the corpus uses a single status legend.
  • Current corpus totals after the pass: 188 solid, 7 needs-tightening, 2 needs-correction, 0 unchecked.

[2026-04-11] audit | Full paper-corpus metadata validation

  • Validated all wiki/sources/papers/ entries at the source-identity level against primary records: 197 total pages, 187 arXiv-backed entries, 10 non-arXiv entries.
  • Fixed three broken source references: solve-synergy-of-language-vision-and-end-to-end-networks-for-autonomous-driving, simlingo-vision-only-closed-loop-autonomous-driving-with-language-action-alignment, and para-drive-parallelized-architecture-for-real-time-autonomous-driving.
  • Recorded the metadata-validation outcome in wiki/queries/paper-fact-check-tracker.md; the follow-up source-faithfulness pass and status normalization are logged above.

[2026-04-11] audit | Random 10-paper fact-check sample

  • Audited a deterministic random sample of 10 paper pages against the original papers and updated paper-faithfullness on all 10 to audited-solid.
  • Corrected hard factual issues in 5 pages: carla-an-open-urban-driving-simulator, surroundocc-multi-camera-3d-occupancy-prediction-for-autonomous-driving, self-improving-embodied-foundation-models, bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding, and drivedreamer-towards-real-world-driven-world-models.
  • Most issues were benchmark-value mixups, loss-function misstatements, incorrect venue/training metadata, or unsupported scope claims.
  • Recorded the batch outcome in wiki/queries/paper-fact-check-tracker.md for future audit coverage.

[2026-04-06] ingest | Gemma 3 Technical Report

  • Added paper wiki page: wiki/sources/papers/gemma-3-technical-report.md
  • Updated: wiki/sources/llm-seminal-papers.md (new open-weight multimodal section + ingested individually list), wiki/concepts/foundation-models.md (LLM section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
  • Citations: ~1120 (user-provided)
  • Tags: transformer, language-modeling, multimodal, foundation-model, vision-language-model, knowledge-distillation, mixture-of-experts, scaling, multilingual

[2026-04-06] ingest | Scaling Instruction-Finetuned Language Models (Flan-PaLM / Flan-T5)

  • Added paper wiki page: wiki/sources/papers/scaling-instruction-finetuned-language-models.md
  • Updated: wiki/sources/llm-seminal-papers.md (instruction tuning section + ingested individually list), wiki/concepts/foundation-models.md (new instruction tuning subsection)
  • Citations: ~3987 (user-provided)
  • Tags: nlp, transformer, instruction-tuning, chain-of-thought, foundation-model, language-modeling, scaling, multi-task

[2026-04-06] ingest | Qwen3 Technical Report

  • Added paper wiki page: wiki/sources/papers/qwen3-technical-report.md
  • Updated: wiki/sources/llm-seminal-papers.md (core architecture list + ingested individually list), wiki/concepts/foundation-models.md (LLM section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
  • Citations: ~3706 (user-provided)
  • Tags: nlp, language-modeling, transformer, mixture-of-experts, foundation-model, reasoning, multilingual, reinforcement-learning

[2026-04-06] ingest | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  • Added paper wiki page: wiki/sources/papers/deepseek-r1-incentivizing-reasoning-capability-in-llms-via-reinforcement-learning.md
  • Updated: wiki/sources/llm-seminal-papers.md (added "Reasoning via reinforcement learning" section + ingested individually list), wiki/concepts/machine-learning.md (reasoning section + key papers table), wiki/concepts/foundation-models.md (chain-of-thought section + key papers table), wiki/queries/open-questions.md (Q8 partial answer on GRPO for driving), wiki/taxonomies/research-map.md (LLM seminal papers count)
  • Citations: ~1920 (user-provided)
  • Tags: nlp, reinforcement-learning, language-modeling, reasoning, chain-of-thought, foundation-model, transformer, alignment

[2026-04-06] ingest | Tree of Thoughts: Deliberate Problem Solving with Large Language Models

  • Added paper wiki page: wiki/sources/papers/tree-of-thoughts-deliberate-problem-solving-with-large-language-models.md
  • Updated: wiki/sources/llm-seminal-papers.md (added "Reasoning and search" section + ingested individually list)
  • Citations: ~3561 (user-provided)
  • Tags: nlp, reasoning, language-modeling, chain-of-thought, search, foundation-model, prompting

[2026-04-06] ingest | Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

  • Added paper wiki page: wiki/sources/papers/gemini-25-pushing-the-frontier-with-advanced-reasoning-multimodality-long-context-and-next-generation-agentic-capabilities.md
  • Updated: wiki/sources/llm-seminal-papers.md (multimodal bridge section + ingested individually list), wiki/concepts/foundation-models.md (transformer and scaling section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
  • Citations: ~1943 (user-provided)
  • Tags: nlp, multimodal, foundation-model, transformer, mixture-of-experts, language-modeling, chain-of-thought, reasoning, agentic

[2026-04-06] ingest | On the Opportunities and Risks of Foundation Models

  • Added paper wiki page: wiki/sources/papers/on-the-opportunities-and-risks-of-foundation-models.md
  • Updated: wiki/concepts/foundation-models.md (new "Defining the paradigm" section + key papers table), wiki/concepts/machine-learning.md (new "Foundation model paradigm" section + key papers table), wiki/sources/llm-seminal-papers.md (new "Foundational surveys and frameworks" section)
  • Citations: ~6057 (user-provided; Semantic Scholar fetch unavailable)
  • Tags: foundation-model, nlp, computer-vision, robotics, multimodal, transformer, survey

[2026-04-06] ingest | BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

  • Added paper wiki page: wiki/sources/papers/bevnext-reviving-dense-bev-frameworks-for-3d-object-detection.md
  • Updated: wiki/concepts/perception.md (BEV revolution section + key papers table), wiki/sources/autonomous-driving-seminal-papers.md (perception seed list)
  • Citations: ~80 (user-provided; Semantic Scholar and AlphaXiv overview unavailable)
  • Tags: autonomous-driving, perception, bev, transformer, computer-vision, 3d-object-detection, cnn, depth-estimation

[2026-04-06] ingest | Emerging Properties in Self-Supervised Vision Transformers (DINO)

  • Added paper wiki page: wiki/sources/papers/emerging-properties-in-self-supervised-vision-transformers.md
  • Updated: wiki/concepts/foundation-models.md (vision-language models section), wiki/sources/llm-seminal-papers.md (multimodal bridge list)
  • Citations: ~10798 (user-provided)
  • Tags: computer-vision, self-supervised-learning, transformer, vision-transformer, knowledge-distillation, image-classification, foundation-model

[2026-04-06] ingest | YOLOv10: Real-Time End-to-End Object Detection

  • Added paper wiki page: wiki/sources/papers/yolov10-real-time-end-to-end-object-detection.md
  • Updated: wiki/concepts/perception.md (key papers table)
  • Citations: ~5988 (user-provided)
  • Tags: computer-vision, object-detection, cnn, end-to-end, real-time, perception

[2026-04-06] ingest | Learning Transferable Visual Models From Natural Language Supervision (CLIP)

  • Updated paper wiki page from seed to active: wiki/sources/papers/learning-transferable-visual-models-from-natural-language-supervision.md
  • Updated: frontmatter (venue ICML 2021, citations 57987, proper tags, arxiv_id), added results comparison table, added linear probe figure, expanded Connections with descriptive annotations
  • Already cross-referenced in: wiki/concepts/foundation-models.md, wiki/concepts/machine-learning.md, wiki/concepts/vision-language-action.md, wiki/sources/llm-seminal-papers.md
  • Citations: 57987 (user-provided)
  • Tags: computer-vision, multimodal, foundation-model, transformer, cnn, image-classification, nlp

[2026-04-06] update | Training Compute-Optimal Large Language Models (Chinchilla)

  • Updated existing paper wiki page: wiki/sources/papers/training-compute-optimal-large-language-models.md
  • Updated frontmatter: type source-summary -> paper, citations 2973 -> 4116, added arxiv_id, updated tags
  • Enriched connections with GPT-3 link and descriptive annotations
  • Citations: ~4116 (user-provided)
  • Tags: nlp, language-modeling, transformer, foundation-model, scaling

[2026-04-05] ingest | VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

  • Added paper wiki page: wiki/sources/papers/voxposer-composable-3d-value-maps-for-robotic-manipulation-with-language-models.md
  • Updated: wiki/sources/vla-and-driving.md (general VLA section), wiki/concepts/robotics.md (key papers table), wiki/queries/open-questions.md (Q4/Q5 partial answers)
  • Citations: ~450 (Semantic Scholar API was unavailable; approximate count used)
  • Tags: robotics, manipulation, language-modeling, multimodal, planning, zero-shot

[2026-04-05] scaffold | Initial vault bootstrap

  • Created the initial wiki structure for ML, autonomous driving, robotics, VLA, e2e systems, perception, prediction, planning, and foundation models.
  • Added AGENTS.md to define ingest, query, and lint workflows for the LLM maintainer.
  • Added seed source-program pages for canonical paper collection and future ingest.
  • Added a Flask frontend scaffold targetable to Railway for hosted browsing.

[2026-04-05] ingest | Initial corpus batch 01

  • Added 27 source-summary pages under wiki/sources/papers/.
  • Seeded the first real corpus across autonomous driving, robotics/VLA, and foundation-model papers.
  • Added wiki/sources/initial-corpus-batch-01.md to group the batch and make it navigable.
  • Updated the top-level index so the new corpus is discoverable from the vault entry point.

[2026-04-05] ingest | Ilya Top 30 corpus

  • Ingested all 30 papers from Ilya Sutskever's canonical reading list as full source-summary pages.
  • Updated wiki/sources/ilya-top-30.md from placeholder to canonical list with thematic clusters and wiki links.
  • Papers span architectures (Transformer, ResNet, AlexNet, ViT), sequence modeling (RNN, LSTM, Pointer Networks), information theory (MDL, Kolmogorov complexity), complexity theory, scaling laws, diffusion models, and chain-of-thought prompting.

[2026-04-05] ingest | AutoVLA corpus (batch 02)

  • Ingested 18 driving VLA papers (2018–2025) from the AutoVLA analysis corpus.
  • Updated 3 existing papers (CIL, DriveLM, LMDrive) from seed to complete status with rich technical content.
  • Created 15 new paper pages: BDD-X, Talk2Car, GPT-Driver, DriveGPT4, VLP, Reason2Drive, SimLingo, ORION, EMMA, DriveMLM, Alpamayo-R1, Senna, WoTE, AlphaDrive, DriveMoE.
  • Updated wiki/sources/vla-and-driving.md with three-wave taxonomy and design axes table.
  • Updated wiki/sources/autonomous-driving-seminal-papers.md with batch 02 entries.

[2026-04-06] ingest | Batch 03 -- robotics VLA, world models, driving transformers

  • Added 5 new paper pages:
  • GR00T N1 (2503.14734) -- open dual-system VLA for humanoid robots, 602 citations
  • Gemini Robotics (2503.20020) -- Gemini 2.0 VLA for physical manipulation, cloud-local hybrid
  • Cosmos (2501.03575) -- world foundation model platform for physical AI, 515 citations
  • AutoVLA (2506.13757) -- adaptive dual-process reasoning VLA for driving with RL, 110 citations
  • DriveTransformer (2503.07656) -- parallel-task sparse transformer for E2E driving, ICLR 2025, 91 citations
  • Updated wiki/sources/vla-and-driving.md with batch 03 entries and Wave 3 additions (AutoVLA, DriveTransformer)
  • Updated wiki/concepts/robotics.md with GR00T N1, Gemini Robotics, and Cosmos sections and key papers table
  • Updated wiki/concepts/foundation-models.md with world foundation models and robotics foundation models sections
  • Updated wiki/concepts/autonomous-driving.md with AutoVLA and DriveTransformer in Era 3 and key papers table
  • Updated wiki/concepts/vision-language-action.md with AutoVLA adaptive reasoning in Wave 3
  • Updated wiki/concepts/end-to-end-architectures.md with AutoVLA and DriveTransformer architectural variants and key papers
  • Updated wiki/concepts/planning.md with AutoVLA and DriveTransformer in key papers table

[2026-04-05] ingest | Batch 04 (self-supervised driving, temporal E2E, BEV, world models, embodied RL)

  • Ingested 5 papers: S4-Driver, BridgeAD, Self-Improving Embodied Foundation Models, GaussianLSS, Drive-OccWorld.
  • S4-Driver (CVPR 2025, 16 cites): self-supervised MLLM for annotation-free driving, achieves 0.31m L2 on nuScenes beating supervised methods.
  • BridgeAD (CVPR 2025, 31 cites): multi-step temporal queries for history-enhanced E2E driving, 19% L2 improvement over UniAD, strong closed-loop safety.
  • Self-Improving EFM (arXiv 2025, 18 cites): Google DeepMind, steps-to-go reward enables autonomous robot self-improvement, 10% data + 1% RL beats 80% imitation.
  • GaussianLSS (CVPR 2025, 18 cites): depth uncertainty + Gaussian Splatting for BEV perception, within 0.4% of SOTA at 2.5x speed and 3.8x less memory.
  • Drive-OccWorld (AAAI 2024, 49 cites): 4D occupancy world model for planning, 33% L2 reduction at 1s vs UniAD, action-controllable neural simulation.
  • Updated wiki/sources/vla-and-driving.md with Batch 04 entries.
  • Updated concept pages: perception, planning, prediction, robotics, autonomous-driving, foundation-models, end-to-end-architectures.

[2026-04-05] ingest | Batch 05 (VLA, world models, momentum planning, distillation)

  • Ingested 5 papers with AlphaXiv overviews and Semantic Scholar citation data:
  • OpenDriveVLA (2503.23463, AAAI 2025, 109 cites) -- open-source VLA with hierarchical 3D scene queries, 0.33m L2 at 0.5B-7B scale
  • HERMES (2501.14729, arXiv 2025, 38 cites) -- unified world model for simultaneous 3D scene understanding and future generation via world queries
  • MomAD (2503.03125, CVPR 2025, 60 cites) -- momentum-aware planning for temporal consistency, 0.60m L2, +16.3% closed-loop success vs VAD
  • GaussianWorld (2412.10373, CVPR 2024, 59 cites) -- 3D Gaussian world model for streaming occupancy prediction, +2% mIoU without inference overhead
  • DiMA (2501.09757, CVPR 2025, 34 cites) -- distill MLLM reasoning into vision planner, 80% collision reduction, LLM discarded at inference
  • Updated wiki/sources/vla-and-driving.md with Wave 3 additions and Batch 05 ingested papers list
  • Updated wiki/concepts/autonomous-driving.md with Era 3 additions and key papers table
  • Updated wiki/concepts/planning.md with MomAD, OpenDriveVLA, DiMA in key papers table
  • Updated wiki/concepts/perception.md with GaussianWorld and HERMES in occupancy section and key papers table
  • Updated wiki/concepts/prediction.md with MomAD temporal consistency section and key papers table
  • Updated wiki/concepts/end-to-end-architectures.md with VLA variants, open problems, and key papers table
  • Updated wiki/concepts/vision-language-action.md with OpenDriveVLA and DiMA in Wave 3

[2026-04-05] ingest | Batch 06 (diffusion/flow planning, scaling laws, RL planning, VLM-E2E synergy, robotics VLA/diffusion)

  • Ingested 8 papers spanning generative planning, scaling laws, RL-based planning, VLM-E2E synergy, embodied CoT, and diffusion robotics:
  • DiffusionDrive (2411.15139, CVPR 2025 Highlight) -- truncated diffusion for E2E driving, 88.1 PDMS on NAVSIM, 2 denoising steps at 45 FPS
  • DriveGPT (2412.14415, ICML 2025, Waymo) -- first scaling laws for driving behavior models, 1.1B params, 100M+ demonstrations
  • GoalFlow (2503.05689, CVPR 2025) -- goal-driven flow matching, 90.3 PDMS on NAVSIM with single-step inference
  • LAW (2406.08481, ICLR 2025) -- self-supervised latent world model, SOTA on nuScenes+NAVSIM+CARLA
  • CarPlanner (2502.19908, CVPR 2025) -- first RL planner to beat IL+rule-based on nuPlan, consistency-regularized autoregressive
  • SOLVE (2505.16805, CVPR 2025) -- Sequential Q-Former + Trajectory CoT for VLM-E2E synergy
  • ECoT (2407.08693, Stanford/Berkeley 2025) -- embodied Chain-of-Thought for VLAs, +28% generalization on OpenVLA
  • RDT-1B (2410.07864, ICLR 2025, Tsinghua) -- largest diffusion transformer for bimanual manipulation, 1.2B params
  • Updated wiki/sources/vla-and-driving.md with Wave 3 additions (6 driving papers) and Batch 06 ingested papers list
  • Key themes: generative trajectory planning (diffusion vs. flow matching), scaling laws for driving, RL surpassing IL, CoT reasoning for embodied agents, bimanual diffusion transformers

[2026-04-05] ingest | Batch 07 (cross-embodiment robotics VLA + 3D occupancy perception)

  • Ingested 8 papers with AlphaXiv overviews and Semantic Scholar citation data:
  • UniAct (2501.10105, CVPR 2025, 60 cites) -- universal action space via VQ codebooks for cross-embodiment VLA, 0.5B beats 14x larger models
  • Dita (2503.19757, ICCV 2025, 54 cites) -- DiT-based VLA with in-context diffusion conditioning, 10-shot real-world adaptation, 334M params
  • Embodiment Scaling Laws (2505.05753, CoRL 2025, 10 cites) -- first power-law scaling for embodiment diversity across ~1000 robots, zero-shot sim-to-real
  • SmolVLA (2506.01844, arXiv 2025, 224 cites) -- 450M VLA from Hugging Face competitive with 3.3B models, async inference, single-GPU training
  • GaussianFormer-2 (2412.04384, CVPR 2025, 57 cites) -- probabilistic Gaussian superposition, 8.9% of Gaussians needed, 51% memory savings
  • OccMamba (2408.09859, CVPR 2025, 32 cites) -- first Mamba-based occupancy network, +5.1% IoU, 65% faster inference via linear complexity
  • GaussTR (2412.13193, CVPR 2025, 41 cites) -- self-supervised 3D occupancy via foundation model alignment, zero-shot 12.27 mIoU without 3D annotations
  • BEVDiffuser (2502.19694, CVPR 2025, 14 cites) -- training-only diffusion for BEV denoising, +12.3% mAP, zero inference overhead
  • Updated wiki/sources/vla-and-driving.md with UniAct, Dita, SmolVLA in general VLA list and Batch 06 ingested papers
  • Updated wiki/concepts/robotics.md with cross-embodiment VLA section (UniAct, Dita, SmolVLA, Embodiment Scaling Laws) and key papers table
  • Updated wiki/concepts/perception.md with GaussianFormer-2, OccMamba, GaussTR, BEVDiffuser in occupancy section and key papers table
  • Key themes: universal action representations vs. model scale, diffusion/flow VLA architectures, embodiment as a scaling axis, efficient 3D occupancy (Gaussian/Mamba/self-supervised/diffusion denoising)

[2026-04-05] ingest | Batch 08 (Physical Intelligence VLA family + robotics VLA advances)

  • Ingested 8 papers spanning the Physical Intelligence pi0 family, VLA training methodology, action tokenization, spatial reasoning, and dexterous manipulation:
  • pi0 (2410.24164, arXiv 2024, 1381 cites) -- flow matching VLA on PaliGemma 3B, 7 platforms, 68 tasks. The reference VLA from Physical Intelligence.
  • pi0.5 (2504.16054, CoRL 2025, 681 cites) -- hierarchical VLA with five-source co-training, first to do 10-15 min tasks in unseen real homes
  • pi0.6 (2511.14759, arXiv 2025, 93 cites) -- RECAP offline RL for VLA self-improvement, doubled task throughput, halved failure rates
  • FAST (2501.09747, RSS 2025, 353 cites) -- DCT+BPE action tokenizer for VLAs, 2x-13x compression, 5x faster training
  • OpenVLA-OFT (2502.19645, arXiv 2025, 364 cites) -- parallel decoding fine-tuning recipe, 76.5% to 97.1% on LIBERO, 26x inference speedup
  • SpatialVLA (2501.15830, arXiv 2025, 292 cites) -- Ego3D position encoding + adaptive action grids, 1.1M real episodes, 73% spatial accuracy
  • DexVLA (2502.05855, CoRL 2025, 140 cites) -- 2B VLM + 1B diffusion expert, 0.92 success on shirt folding, three-stage embodied curriculum
  • Knowledge Insulation (2505.23705, NeurIPS 2025 Spotlight, 68 cites) -- stop-gradient + co-training prevents VLM degradation during VLA training, 7.5x faster convergence
  • Updated wiki/sources/vla-and-driving.md with General VLA foundations entries and Batch 07 ingested papers list
  • Key themes: flow matching vs. diffusion for action generation, action tokenization (FAST) vs. continuous (pi0), VLA self-improvement via RL (pi0.6), knowledge preservation during fine-tuning (insulation), spatial reasoning (SpatialVLA), scaling action experts (DexVLA)

[2026-04-05] ingest | Batch 09 (world models, parallel E2E, generative driving, evaluation, LLM-for-driving)

  • Ingested 5 papers with AlphaXiv overviews and Semantic Scholar citation data:
  • DriveDreamer (2309.09777, ECCV 2024, ~452 cites) -- first real-world-driven world model for driving, diffusion-based Auto-DM with two-stage training, 0.29m L2, 21% collision reduction
  • PARA-Drive (CVPR 2024, NVIDIA, ~179 cites) -- systematic design space exploration of modular E2E stacks, fully parallel architecture with implicit BEV communication, 2-3x speedup
  • GenAD (2402.11502, ECCV 2024, ~189 cites) -- E2E driving as generative modeling, VAE trajectory prior + instance-centric scene representation, 0.91m L2, 0.43% collision rate SOTA
  • Is Ego Status All You Need? (2312.03031, CVPR 2024, NVIDIA/Nanjing, ~199 cites) -- exposes that simple Ego-MLP matches complex E2E models on nuScenes open-loop, proposes Curb Collision Rate metric
  • Driving with LLMs (2310.01957, ICRA 2024, Wayve, ~328 cites) -- first concrete LLM-for-driving with object-level vector modality, LLaMA-7B + LoRA, explainable decisions
  • Updated wiki/sources/vla-and-driving.md with Wave 2 additions and Batch 08 ingested papers list
  • Key themes: world models for driving (DriveDreamer), parallel vs. sequential E2E design (PARA-Drive), generative trajectory modeling (GenAD), evaluation methodology critique (Ego Status), LLM integration for explainability (Driving with LLMs)

[2026-04-05] ingest | Batch 10 (orchestration, cross-embodiment, async planning, Gaussian representations, occupancy world models)

  • Ingested 6 papers with AlphaXiv overviews and Semantic Scholar citation data:
  • AutoRT (2401.12963, arXiv 2024, Google DeepMind, 110 cites) -- foundation model orchestration for large-scale robot data collection, 77K episodes, 53 robots, Robot Constitution for safety
  • HPT (2409.20537, NeurIPS 2024, 134 cites) -- stem-trunk-head architecture for cross-embodiment scaling, first robotics scaling laws across data/diversity/model size/compute, 10-30% sim gains, 20%+ real gains
  • AsyncDriver (2406.14556, ECCV 2024, 41 cites) -- asynchronous LLM-planner decoupling, Llama2-13B guidance at sparse intervals, ~40% cost reduction with ~1% accuracy loss on nuPlan
  • GaussianFormer (2405.17429, ECCV 2024, 128 cites) -- sparse 3D semantic Gaussian occupancy representation, 5-6x memory reduction vs dense methods with ~2% mIoU trade-off
  • DrivingGaussian (2312.07920, CVPR 2024, 398 cites) -- composite Gaussian splatting for dynamic driving scenes, IS3G + CDGG, 28.74 PSNR on nuScenes, LiDAR-prior integration
  • OccWorld (2311.16038, ECCV 2024, 198 cites) -- original 3D occupancy world model, VQ-VAE tokenization + GPT-like spatial-temporal transformer, joint scene-ego forecasting, competitive with UniAD sans HD maps
  • Updated wiki/sources/vla-and-driving.md with Batch 09 ingested papers list
  • Updated wiki/concepts/robotics.md with AutoRT/HPT data collection and cross-embodiment sections + key papers
  • Updated wiki/concepts/perception.md with GaussianFormer, DrivingGaussian, OccWorld in occupancy section + key papers
  • Updated wiki/concepts/planning.md with AsyncDriver and OccWorld in key papers
  • Updated wiki/concepts/autonomous-driving.md with all 4 driving papers in key papers table
  • Updated wiki/concepts/foundation-models.md with AutoRT and HPT in key papers table
  • Key themes: foundation models as orchestrators (not just controllers), robotics scaling laws, asynchronous LLM integration for real-time planning, Gaussian representations as efficient alternative to dense voxels, occupancy-based world models

[2026-04-05] ingest | Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

  • Added paper wiki page: wiki/sources/papers/think-twice-before-driving-towards-scalable-decoders-for-end-to-end-autonomous-driving.md
  • Updated: wiki/concepts/planning.md, wiki/concepts/end-to-end-architectures.md, wiki/sources/autonomous-driving-seminal-papers.md
  • Citations: unavailable (Semantic Scholar fetch failed)
  • Tags: autonomous-driving, end-to-end, planning, imitation-learning, transformer, perception

[2026-04-05] synthesis | Wiki-wide updates from new corpus

  • Updated wiki/concepts/vision-language-action.md from seed to active with three-wave analysis, design axes, and emerging consensus.
  • Updated wiki/syntheses/research-thesis.md with AutoVLA evidence (supporting, refining, and partially challenging the thesis). Confidence raised to medium.
  • Updated wiki/queries/open-questions.md with 8 new questions from AutoVLA analysis and partial answers.
  • Updated wiki/taxonomies/research-map.md with source program table, routing guide, and VLA sub-taxonomy.
  • Updated index.md descriptions and log.md with ingest records.

[2026-04-05] ingest | Batch 11 (Gaussian occupancy cluster, radar fusion, sparse E2E, pseudo-simulation, robotics VLA)

  • Ingested 8 papers with AlphaXiv overviews and Semantic Scholar citation data:
  • GaussianOcc (2408.11447, ICCV 2025, 47 cites) -- fully self-supervised 3D occupancy via Gaussian splatting (no GT pose), 2.7x faster training, 5x faster rendering
  • GaussianFlowOcc (2502.17288, ICCV 2025, 19 cites) -- sparse Gaussian occupancy + temporal flow, 51%+ mIoU improvement, 50x faster inference with 2D pseudo-labels
  • RaCFormer (2412.12725, CVPR 2025, 15 cites) -- radar-camera fusion via query-based dual-view attention + Doppler dynamic catcher, 64.9% mAP surpassing LiDAR-only
  • GaussRender (2502.05040, ICCV 2025, 13 cites) -- plug-and-play Gaussian rendering loss for 3D-2D projective consistency, +3.75 mIoU on TPVFormer, zero inference overhead
  • VPP (2412.14803, ICML 2025 Spotlight, 139 cites) -- video diffusion as predictive visual encoder for robot policies, +18.6% on CALVIN, +31.6% real-world dexterous
  • Helix (Figure AI Technical Report, Feb 2025) -- first whole-body humanoid VLA, System 1+2 dual architecture, 35 DoF at 200Hz, dual-robot coordination
  • NAVSIM v2 (2506.04218, CoRL 2025, 62 cites) -- pseudo-simulation evaluation via 3D Gaussian Splatting, R^2=0.8 with closed-loop, de facto E2E driving benchmark
  • SparseDrive (2405.19620, ICRA 2025, 181 cites) + SparseDriveV2 (2603.29163, 2026) -- fully sparse E2E driving with factorized trajectory vocabulary (262K candidates), 92.0 PDMS NAVSIM SOTA
  • Updated wiki/concepts/perception.md with Gaussian occupancy cluster (GaussianOcc, GaussianFlowOcc, GaussRender), radar-camera fusion (RaCFormer), and key papers table
  • Updated wiki/concepts/autonomous-driving.md with all 8 papers in key papers table
  • Updated wiki/concepts/planning.md with SparseDrive, SparseDriveV2, NAVSIM v2 in key papers table
  • Updated wiki/concepts/end-to-end-architectures.md with SparseDrive, SparseDriveV2, NAVSIM v2 in key papers table
  • Updated wiki/concepts/robotics.md with VPP and Helix in key papers table
  • Updated wiki/concepts/vision-language-action.md with VPP and Helix in robotics VLA frontier section
  • Key themes: Gaussian splatting as unified primitive for occupancy/BEV/simulation, radar replacing LiDAR, scoring-based planning scaling laws, pseudo-simulation bridging open/closed-loop, video diffusion for robot policies, dual-system humanoid VLA

[2026-04-05] ingest | Drive as You Speak

  • Added paper wiki page: wiki/sources/papers/drive-as-you-speak-enabling-human-like-interaction-with-large-language-models-in-autonomous-vehicles.md
  • Updated: wiki/sources/vla-and-driving.md, wiki/queries/open-questions.md
  • Citations: 0 (Semantic Scholar unavailable)
  • Tags: autonomous-driving, llm, planning, nlp, multimodal, human-interaction

[2026-04-05] ingest | Agent-Driver: A Language Agent for Autonomous Driving

  • Added paper wiki page: wiki/sources/papers/a-language-agent-for-autonomous-driving.md
  • Updated: wiki/sources/vla-and-driving.md, wiki/concepts/planning.md, wiki/taxonomies/research-map.md
  • Citations: 140 (Semantic Scholar)
  • Tags: autonomous-driving, llm, planning, reasoning, chain-of-thought, end-to-end

[2026-04-05] ingest | RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

  • Added paper wiki page: wiki/sources/papers/robocat-a-self-improving-generalist-agent-for-robotic-manipulation.md
  • Updated: wiki/concepts/robotics.md, wiki/sources/vla-and-driving.md
  • Citations: 0 (Semantic Scholar fetch failed)
  • Tags: robotics, transformer, imitation-learning, multimodal, foundation-model, multi-embodiment

[2026-04-05] ingest | OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

  • Added paper wiki page: wiki/sources/papers/occformer-dual-path-transformer-for-vision-based-3d-semantic-occupancy-prediction.md
  • Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md, wiki/taxonomies/research-map.md
  • Citations: ~280 (Semantic Scholar fetch failed, estimated from known data)
  • Tags: autonomous-driving, perception, transformer, computer-vision, occupancy, 3d-semantic-occupancy, end-to-end

[2026-04-05] ingest | BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

  • Added paper wiki page: wiki/sources/papers/bevformer-v2-adapting-modern-image-backbones-to-birds-eye-view-recognition-via-perspective-supervision.md
  • Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md, wiki/sources/papers/bevformer-learning-birds-eye-view-representation-from-multi-camera-images-via-spatiotemporal-transformers.md
  • Citations: ~250 (Semantic Scholar fetch failed, estimated)
  • Tags: autonomous-driving, perception, bev, transformer, computer-vision, end-to-end

[2026-04-05] ingest | SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving

  • Added paper wiki page: wiki/sources/papers/surroundocc-multi-camera-3d-occupancy-prediction-for-autonomous-driving.md
  • Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md
  • Citations: ~350 (Semantic Scholar fetch failed, estimated)
  • Tags: autonomous-driving, perception, occupancy, 3d-reconstruction, computer-vision, multi-camera, cnn

[2026-04-05] ingest | FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin

  • Added paper wiki page: wiki/sources/papers/flashocc-fast-and-memory-efficient-occupancy-prediction-via-channel-to-height-plugin.md
  • Updated: wiki/concepts/perception.md (occupancy section + key papers table)
  • Citations: 0 (Semantic Scholar fetch failed)
  • Tags: autonomous-driving, perception, 3d-occupancy, bev, computer-vision, cnn, efficient-inference

[2026-04-06] update | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

  • Updated existing paper wiki page: wiki/sources/papers/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale.md
  • Updated citations: 60022 → 91128 (user-provided)
  • Updated frontmatter: year to 2021, type to paper, status to active, added arxiv_id field, added foundation-model tag
  • Fixed broken wikilink in wiki/sources/ilya-top-30.md (entry #29 pointed to wrong slug)
  • Expanded Connections section with CLIP and BERT links
  • Tags: ilya-30, vision-transformer, computer-vision, transformer, image-classification, foundation-model

[2026-04-05] ingest | FB-BEV: BEV Representation from Forward-Backward View Transformations

  • Added paper wiki page: wiki/sources/papers/fb-bev-bev-representation-from-forward-backward-view-transformations.md
  • Updated: wiki/concepts/perception.md, wiki/sources/autonomous-driving-seminal-papers.md
  • Citations: ~150 (Semantic Scholar unavailable, estimated)
  • Tags: autonomous-driving, perception, bev, transformer, computer-vision

[2026-04-06] ingest | Diffusion Models Beat GANs on Image Synthesis

  • Added paper wiki page: wiki/sources/papers/diffusion-models-beat-gans-on-image-synthesis.md
  • Updated: wiki/sources/papers/denoising-diffusion-probabilistic-models.md (added connection), wiki/concepts/machine-learning.md (self-supervised section + key papers table)
  • Citations: 13548 (user-provided)
  • Tags: computer-vision, diffusion, generative-models, image-generation, classifier-guidance

[2026-04-06] ingest | Exploring Simple Siamese Representation Learning (SimSiam)

  • Added paper wiki page: wiki/sources/papers/exploring-simple-siamese-representation-learning.md
  • Updated: wiki/concepts/machine-learning.md (self-supervised section + key papers table), wiki/concepts/foundation-models.md (vision-language models section)
  • Citations: 6444 (user-provided; Semantic Scholar fetch unavailable)
  • Tags: computer-vision, self-supervised-learning, representation-learning, siamese-networks, contrastive-learning

[2026-04-06] ingest | Prefix-Tuning: Optimizing Continuous Prompts for Generation

  • Added paper wiki page: wiki/sources/papers/prefix-tuning-optimizing-continuous-prompts-for-generation.md
  • Updated: wiki/sources/llm-seminal-papers.md (added PEFT section + wikilink), wiki/concepts/foundation-models.md (LLM section with PEFT context), wiki/concepts/machine-learning.md (new parameter-efficient adaptation section + key papers table), wiki/taxonomies/research-map.md (LLM seminal papers count)
  • Citations: 6753 (user-provided; Semantic Scholar fetch unavailable)
  • Tags: nlp, transformer, parameter-efficient, language-modeling, fine-tuning

[2026-04-06] ingest | High-Resolution Image Synthesis with Latent Diffusion Models

  • Added paper wiki page: wiki/sources/papers/high-resolution-image-synthesis-with-latent-diffusion-models.md
  • Updated: wiki/concepts/foundation-models.md (diffusion models section + key papers table), wiki/taxonomies/research-map.md (added generative models routing)
  • Citations: 31987 (user-provided; Semantic Scholar fetch unavailable)
  • Tags: diffusion, generative-models, computer-vision, image-generation, foundation-model, transformer

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260414)

  • Audited another non-overlapping deterministic sample of 10 paper summaries against the original papers.
  • Downgraded pi0-a-vision-language-action-flow-model-for-general-robot-control to audited-needs-tightening after removing an unsupported cross-embodiment transfer claim.
  • Downgraded rdt-1b-a-diffusion-foundation-model-for-bimanual-manipulation to audited-needs-tightening after correcting the limitation section to reflect real-robot, not simulation-first, evaluation.
  • Corpus status after this pass: audited-solid 183, audited-needs-tightening 11, audited-needs-correction 3, unchecked 0.

[2026-04-11] audit | Random 20-paper serious-error check (seed 20260415)

  • Audited a fresh non-overlapping deterministic sample of 20 paper summaries against the primary papers, with emphasis on serious benchmark/setup errors rather than soft phrasing drift.
  • Downgraded flashocc-fast-and-memory-efficient-occupancy-prediction-via-channel-to-height-plugin to audited-needs-correction after fixing materially wrong mIoU / speed / memory claims and correcting the plug-in baseline framing.
  • Downgraded rt-2-vision-language-action-models-transfer-web-knowledge-to-robotic-control to audited-needs-tightening after removing an unsupported quantified chain-of-thought improvement claim.
  • The other 18 sampled pages did not show serious factual failures and were left unchanged; actual current frontmatter totals are 164 solid, 15 clean, 16 fixed, 1 needs-tightening, and 1 needs-correction.

[2026-04-11] audit | Random 20-paper sample (seed 20260416)

  • Audited a fresh 20-paper sample chosen to avoid overlap with all previous random samples recorded in wiki/queries/paper-fact-check-tracker.md.
  • Corrected gaussianformer-2-probabilistic-gaussian-superposition-for-efficient-3d-occupancy-prediction after the summary mixed a 25.6K-Gaussian ablation with the 12.8K-Gaussian nuScenes main result and therefore cited the wrong main-result resource figures.
  • Tightened chauffeurnet-learning-to-drive-by-imitating-the-best-and-synthesizing-the-worst by replacing unsupported aggregate collision/off-road claims with the paper's actual scenario-based closed-loop findings and real-world deployment description.
  • The other 18 sampled pages were materially consistent with their source papers.
  • Current frontmatter totals: audited-solid 162, audited-clean 15, audited-fixed 18, audited-needs-tightening 1, audited-needs-correction 1.

[2026-04-11] audit | Random 10-paper fact-check sample (seed 20260417)

  • Audited a fresh non-overlapping deterministic sample of 10 paper summaries against the original papers.
  • Corrected vectornet-encoding-hd-maps-and-agent-dynamics-from-vectorized-representation after the summary understated the ConvNet FLOP gap; the paper reports 10.56G vs 0.041G FLOPs (about 200x fewer / 99.6% lower), not 70% fewer.
  • Corrected smolvla-a-vision-language-action-model-for-affordable-robotics after the overview and results overstated the memory advantage; the paper says 6x less memory than pi0, not 7x.
  • The other 8 sampled pages were materially consistent with their source papers.
  • Current frontmatter totals: audited-solid 160, audited-clean 15, audited-fixed 20, audited-needs-tightening 1, audited-needs-correction 1.