Overview
The Cosmos World Foundation Model Platform addresses Physical AI's critical challenge: the scarcity of safe, high-quality training data. By providing high-fidelity digital twins of the physical world, the platform enables safer and more efficient training for embodied agents across robotics and autonomous driving. The comprehensive ecosystem includes video data curation pipelines, pre-trained foundation models, specialized tokenizers, and robust safety mechanisms, all released as open-source and open-weight components.
The platform encompasses four major components: a scalable video curation pipeline processing 20 million hours of raw content into 100 million high-quality clips; advanced visual tokenizers supporting both continuous and discrete representations; two complementary World Foundation Model (WFM) architectures -- diffusion-based and autoregressive -- trained on 10,000 H100 GPUs; and fine-tuned models for camera control, robotic manipulation, and autonomous driving applications.
Key technical achievements include tokenization with +4 dB PSNR improvement and 2x-12x faster inference compared to prior work, real-time video generation at 10 FPS at 320x512 resolution, and downstream applications demonstrating lower FID/FVD scores and less than 7cm trajectory following error in autonomous driving. The paper candidly acknowledges that current WFMs struggle with complex physical laws and exhibit issues with object permanence and contact-rich dynamics.
Key Contributions
- Scalable video curation pipeline: Processing 20M hours of raw video into 100M high-quality clips for world model training
- Dual tokenizer architecture: Both continuous (for diffusion) and discrete (for autoregressive) visual tokenizers with +4 dB PSNR improvement
- Two complementary WFM architectures: Diffusion-based and autoregressive world models trained at massive scale on 10K H100 GPUs
- Open-source release: Full platform including models, tokenizers, and curation pipeline released as open-weight components
- Multi-domain fine-tuning: Demonstrated applications in camera control, robotic manipulation, and autonomous driving
Architecture / Method
┌─────────────────────────────────────────────────────────────────┐
│ Cosmos Platform Overview │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 1. Video Curation Pipeline │ │
│ │ 20M hrs raw ──► Filter/Dedup ──► Quality Score ──► 100M │ │
│ │ + Tag clips │ │
│ └──────────────────────────┬────────────────────────────────┘ │
│ │ Curated Video Corpus │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 2. Visual Tokenizers │ │
│ │ ┌─────────────────┐ ┌──────────────────┐ │ │
│ │ │ Continuous │ │ Discrete │ │ │
│ │ │ Tokenizer │ │ Tokenizer │ │ │
│ │ │ (latent vecs) │ │ (token indices) │ │ │
│ │ └────────┬────────┘ └────────┬─────────┘ │ │
│ └───────────┼────────────────────────┼──────────────────────┘ │
│ ▼ ▼ │
│ ┌───────────────────┐ ┌────────────────────────┐ │
│ │ 3a. Diffusion │ │ 3b. Autoregressive │ │
│ │ WFM │ │ WFM │ │
│ │ (iterative │ │ (next-token │ │
│ │ denoising) │ │ prediction) │ │
│ └─────────┬─────────┘ └───────────┬────────────┘ │
│ └──────────────┬──────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 4. Domain Fine-Tuning │ │
│ │ ┌────────────┐ ┌──────────────┐ ┌────────────────┐ │ │
│ │ │ Camera │ │ Robotic │ │ Autonomous │ │ │
│ │ │ Control │ │ Manip. │ │ Driving │ │ │
│ │ └────────────┘ └──────────────┘ └────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Trained on 10,000 H100 GPUs
The Cosmos platform has four pillars:
1. Video Curation Pipeline: Processes raw internet video through filtering, deduplication, quality scoring, and semantic tagging to produce a curated training corpus. The pipeline scales from 20M hours of raw content to 100M high-quality clips.
2. Visual Tokenizers: Two types serve different downstream architectures: - Continuous tokenizers produce latent representations for diffusion-based models - Discrete tokenizers produce token sequences for autoregressive models - Both achieve +4 dB PSNR over prior art with 2-12x faster inference
3. World Foundation Models: - Diffusion-based WFM: Generates future video frames by iterative denoising, conditioned on past frames and optional control signals - Autoregressive WFM: Predicts next video tokens in sequence, enabling integration with language model architectures
4. Downstream Fine-tuning: Models are adapted for specific domains: - Camera control (view synthesis) - Robotic manipulation (action-conditioned world simulation) - Autonomous driving (future scene prediction with trajectory conditioning)
Results
| Metric | Cosmos Performance |
|---|---|
| Tokenizer PSNR improvement | +4 dB over prior art |
| Inference speedup | 2x-12x faster |
| Video generation speed | 10 FPS at 320x512 |
| AD trajectory error | <7 cm |
| FID/FVD scores | Lower than baselines |
| Training scale | 10,000 H100 GPUs |
The platform demonstrates competitive world modeling quality across domains, with particular strength in driving scene generation where trajectory conditioning enables realistic future prediction.
Limitations & Open Questions
- Current WFMs struggle with complex physical laws (gravity, friction, fluid dynamics), limiting their fidelity as physics simulators
- Object permanence and contact-rich dynamics remain challenging -- objects may appear/disappear or interpenetrate in generated scenes
- The gap between world model predictions and ground-truth physics simulation is not yet characterized well enough for safety-critical applications
Connections
- Denoising Diffusion Probabilistic Models -- Diffusion framework that underlies the diffusion-based WFM architecture
- Wote End To End Driving With Online Trajectory Evaluation Via Bev World Model -- WoTE uses world models for trajectory evaluation; Cosmos provides the foundation platform for training such models
- Groot N1 An Open Foundation Model For Generalist Humanoid Robots -- GR00T N1 uses synthetic simulation data from its data pyramid; Cosmos could provide higher-fidelity simulation
- Robotics -- World models as training data generators for robotic agents
- Autonomous Driving -- Future scene prediction for planning and simulation
- Foundation Models -- Extends the foundation model paradigm to world simulation