NVIDIA released Cosmos 2.5, the next generation of its world foundation models designed for synthetic data generation and physical AI reasoning. The update introduces three specialized models that let developers generate photorealistic video from structural inputs, predict future world states, and perform spatiotemporal reasoning across long video sequences. All models are available now on GitHub, Hugging Face, and build.nvidia.com.

What Happened

On March 13, 2026, NVIDIA announced the Cosmos 2.5 family of world foundation models, expanding its toolkit for developers building physical AI systems and synthetic training pipelines. The release includes three distinct models, each targeting a different stage of the world simulation and understanding process.

Cosmos Transfer 2.5 generates photorealistic video from structural inputs such as segmentation maps, depth data, and 3D bounding boxes. Built on a ControlNet architecture, it translates technical scene descriptions into visually realistic output that can be used for training data or simulation environments.

Cosmos Predict 2.5 models future world states as video sequences, producing up to 30-second clips with multiview support. With post-training, NVIDIA reports up to 10x higher accuracy in predicting how physical environments will change over time, making it a core tool for robotics and autonomous system development.

Cosmos Reason 2 handles spatiotemporal reasoning across video with a 256K token context window. It can perform object detection with bounding boxes and analyze how objects move and interact across extended sequences, giving AI systems a deeper understanding of physical environments.

Why It Matters

Training physical AI systems, from robots to autonomous vehicles, requires massive amounts of real-world data that is expensive and time-consuming to collect. Cosmos 2.5 addresses this by enabling high-quality synthetic data generation that can supplement or replace real-world data collection in many scenarios. The combination of generation, prediction, and reasoning in a single model family means developers can build end-to-end pipelines without stitching together disparate tools.

This release also signals NVIDIA's continued push beyond traditional GPU hardware into AI software infrastructure. As the company outlined at GTC 2026, world foundation models are central to its strategy for physical AI, bridging the gap between simulation and real-world deployment.

Key Details

  • Three models: Transfer 2.5 (video generation from structural data), Predict 2.5 (future state modeling), Reason 2 (spatiotemporal understanding)
  • Architecture: Transfer uses ControlNet; Reason supports 256K context for long video analysis
  • Performance: Predict achieves up to 10x accuracy improvement with post-training, generates 30-second multiview sequences
  • Availability: Open access on GitHub, Hugging Face, and build.nvidia.com
  • Implementation: The Cosmos Cookbook provides guides for developers integrating these models
  • Primary use cases: Synthetic training data, physical AI development, robotics simulation, autonomous systems

What to Do Next

Developers working on physical AI, robotics, or autonomous systems should explore the Cosmos 2.5 model family through the Cosmos Cookbook on GitHub. Transfer 2.5 is the starting point for anyone needing synthetic video data from existing 3D scene layouts or depth maps. For teams already generating simulation data, Predict 2.5 can extend their pipelines with future-state prediction capabilities.

Creative AI developers should pay attention to Transfer 2.5 in particular. Its ability to convert structural scene descriptions into photorealistic video has applications beyond robotics, including architectural visualization, game development, and visual effects previsualization. The ControlNet-based approach means integration with existing 3D workflows is straightforward for teams already working with depth and segmentation data.