ESC

📄 Read on arXiv

Overview

The Cosmos World Foundation Model Platform addresses Physical AI's critical challenge: the scarcity of safe, high-quality training data. By providing high-fidelity digital twins of the physical world, the platform enables safer and more efficient training for embodied agents across robotics and autonomous driving. The comprehensive ecosystem includes video data curation pipelines, pre-trained foundation models, specialized tokenizers, and robust safety mechanisms, all released as open-source and open-weight components.

The platform encompasses four major components: a scalable video curation pipeline processing 20 million hours of raw content into 100 million high-quality clips; advanced visual tokenizers supporting both continuous and discrete representations; two complementary World Foundation Model (WFM) architectures -- diffusion-based and autoregressive -- trained on 10,000 H100 GPUs; and fine-tuned models for camera control, robotic manipulation, and autonomous driving applications.

Key technical achievements include tokenization with +4 dB PSNR improvement and 2x-12x faster inference compared to prior work, real-time video generation at 10 FPS at 320x512 resolution, and downstream applications demonstrating lower FID/FVD scores and less than 7cm trajectory following error in autonomous driving. The paper candidly acknowledges that current WFMs struggle with complex physical laws and exhibit issues with object permanence and contact-rich dynamics.

Key Contributions

  • Scalable video curation pipeline: Processing 20M hours of raw video into 100M high-quality clips for world model training
  • Dual tokenizer architecture: Both continuous (for diffusion) and discrete (for autoregressive) visual tokenizers with +4 dB PSNR improvement
  • Two complementary WFM architectures: Diffusion-based and autoregressive world models trained at massive scale on 10K H100 GPUs
  • Open-source release: Full platform including models, tokenizers, and curation pipeline released as open-weight components
  • Multi-domain fine-tuning: Demonstrated applications in camera control, robotic manipulation, and autonomous driving

Architecture / Method

┌─────────────────────────────────────────────────────────────────┐
                    Cosmos Platform Overview                      
                                                                 
  ┌───────────────────────────────────────────────────────────┐  
    1. Video Curation Pipeline                                 
    20M hrs raw ──► Filter/Dedup ──► Quality Score ──► 100M    
                                      + Tag         clips      
  └──────────────────────────┬────────────────────────────────┘  
                              Curated Video Corpus              
                                                                
  ┌───────────────────────────────────────────────────────────┐  
    2. Visual Tokenizers                                       
    ┌─────────────────┐      ┌──────────────────┐              
       Continuous              Discrete                    
       Tokenizer               Tokenizer                   
      (latent vecs)          (token indices)               
    └────────┬────────┘      └────────┬─────────┘              
  └───────────┼────────────────────────┼──────────────────────┘  
                                                                
  ┌───────────────────┐   ┌────────────────────────┐             
    3a. Diffusion         3b. Autoregressive                 
        WFM                   WFM                            
    (iterative            (next-token                        
     denoising)            prediction)                       
  └─────────┬─────────┘   └───────────┬────────────┘             
            └──────────────┬──────────┘                           
                                                                 
  ┌───────────────────────────────────────────────────────────┐  
    4. Domain Fine-Tuning                                      
    ┌────────────┐  ┌──────────────┐  ┌────────────────┐       
      Camera         Robotic        Autonomous          
      Control        Manip.         Driving             
    └────────────┘  └──────────────┘  └────────────────┘       
  └───────────────────────────────────────────────────────────┘  
└─────────────────────────────────────────────────────────────────┘
  Trained on 10,000 H100 GPUs

The Cosmos platform has four pillars:

1. Video Curation Pipeline: Processes raw internet video through filtering, deduplication, quality scoring, and semantic tagging to produce a curated training corpus. The pipeline scales from 20M hours of raw content to 100M high-quality clips.

2. Visual Tokenizers: Two types serve different downstream architectures: - Continuous tokenizers produce latent representations for diffusion-based models - Discrete tokenizers produce token sequences for autoregressive models - Both achieve +4 dB PSNR over prior art with 2-12x faster inference

3. World Foundation Models: - Diffusion-based WFM: Generates future video frames by iterative denoising, conditioned on past frames and optional control signals - Autoregressive WFM: Predicts next video tokens in sequence, enabling integration with language model architectures

4. Downstream Fine-tuning: Models are adapted for specific domains: - Camera control (view synthesis) - Robotic manipulation (action-conditioned world simulation) - Autonomous driving (future scene prediction with trajectory conditioning)

Results

Metric Cosmos Performance
Tokenizer PSNR improvement +4 dB over prior art
Inference speedup 2x-12x faster
Video generation speed 10 FPS at 320x512
AD trajectory error <7 cm
FID/FVD scores Lower than baselines
Training scale 10,000 H100 GPUs

The platform demonstrates competitive world modeling quality across domains, with particular strength in driving scene generation where trajectory conditioning enables realistic future prediction.

Limitations & Open Questions

  • Current WFMs struggle with complex physical laws (gravity, friction, fluid dynamics), limiting their fidelity as physics simulators
  • Object permanence and contact-rich dynamics remain challenging -- objects may appear/disappear or interpenetrate in generated scenes
  • The gap between world model predictions and ground-truth physics simulation is not yet characterized well enough for safety-critical applications

Connections