NVIDIA Cosmos 3 Tops Open-Source Image and Video Benchmarks

NVIDIA released Cosmos 3 on May 31, 2026, and it immediately ranked first among open-source models on the Artificial Analysis Text-to-Image leaderboard. The 64-billion-parameter model also achieves top scores on image-to-video benchmarks, making it the highest-quality open model available for both tasks at the same time.

The weights are available now on Hugging Face under the OpenMDW 1.1 license, which allows commercial use of outputs with attribution.

What Happened

At Computex 2026, NVIDIA open-sourced Cosmos 3 as the first open omni-model for physical AI. The release includes weights, training code, and datasets for two base model sizes. Both integrate with Hugging Face Diffusers, vLLM, and PyTorch, so they slot into existing local inference pipelines without custom tooling.

Alongside the base models, NVIDIA published task-specific variants optimized for single use cases: Cosmos3-Super-Text2Image and Cosmos3-Super-Image2Video. These are fine-tuned sub-models built on the Super base, stripped down to one output modality each for faster inference and cleaner results on creative tasks.

Why It Matters for Creators

The T2I leaderboard ranking is the headline number. Artificial Analysis measures image quality, prompt adherence, and aesthetic consistency across both open and closed models. Cosmos 3 Super reaching the top spot among open models means it outperforms existing options including FLUX and SDXL variants on that benchmark.

The image-to-video performance matters for anyone working on product visualization, architectural walkthroughs, or animation references. Cosmos 3 takes a static image and generates physically plausible motion video. No audio in the I2V output, which keeps the model focused and the file sizes manageable.

The licensing terms are creator-friendly. The OpenMDW 1.1 license NVIDIA uses for Cosmos permits commercial use of generated outputs, meaning studios and freelancers can incorporate results into client work.

Model Comparison

Model	Parameters	Hardware Target	Primary Task	License
Cosmos3-Nano	8B	RTX PRO 6000, workstation	Multi-modal omni-model	OpenMDW 1.1
Cosmos3-Super	64B	Hopper / Blackwell GPU	Multi-modal omni-model	OpenMDW 1.1
Cosmos3-Super-Text2Image	64B	Hopper / Blackwell GPU	Image generation only	OpenMDW 1.1
Cosmos3-Super-Image2Video	64B	Hopper / Blackwell GPU	Image-to-video only	OpenMDW 1.1
FLUX.1-dev	12B	24GB VRAM consumer GPU	Image generation	FLUX non-commercial
SDXL	3.5B	8GB+ VRAM consumer GPU	Image generation	CreativeML Open RAIL+M

How to Run Cosmos 3 Locally

The Nano model targets workstation hardware at 8 billion total parameters in BF16. Expect roughly 16GB of VRAM as a starting requirement, though community testing will surface the actual floor. The Cosmos3-Nano page on Hugging Face includes the recommended launch configuration.

A minimal Text2Image run using Diffusers:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Super-Text2Image",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

result = pipe("A studio product photograph, soft lighting, white background")
result.images[0].save("output.jpg")

For the Nano variant on workstation hardware, replace the model ID with nvidia/Cosmos3-Nano. The API interface is identical between models.

Production deployment options include vLLM-Omni and NVIDIA NIM microservices. The NVIDIA Developer blog covers the full deployment path for each framework.

Architecture Notes

Cosmos 3 uses a Mixture-of-Transformers (MoT) design with two independent towers. The reasoning tower processes inputs across text, image, video, and action data. The generation tower produces outputs in the same modality range. Because the towers are separate, the model can switch between tasks in a single session without reloading weights.

The Super model trained on 1.3 billion data points spanning robotics demonstrations, autonomous driving footage, and open web visual data. That breadth is what gives the image quality its edge: the model has seen more edge cases in physical environments than models trained purely on aesthetic datasets.

Frequently Asked Questions

Does Cosmos 3 outperform FLUX.1 for creative image generation?

On the Artificial Analysis T2I leaderboard, Cosmos 3 Super ranks above current open models including FLUX variants. However, benchmark rankings measure specific quality dimensions. FLUX.1-dev produces strong artistic outputs and runs on consumer 24GB cards. Test both on your specific prompts before switching production pipelines.

Can I use Cosmos 3 outputs commercially?

Yes. The OpenMDW 1.1 license permits commercial use of model outputs with attribution. Redistribution of the model weights or fine-tuned derivatives requires maintaining the same license. Check the full license on the GitHub repository for the complete terms.

What GPU do I need to run Cosmos 3 Nano locally?

NVIDIA documents Cosmos 3 Nano running on the RTX PRO 6000 workstation GPU. The 8B parameter Nano model in BF16 requires approximately 16GB VRAM as a minimum estimate. Consumer GPU compatibility will depend on community testing in the weeks following launch.

How does the Image2Video variant work?

Cosmos3-Super-Image2Video takes a still image as input alongside a text prompt describing desired motion, and outputs an MP4 video. The generation is physics-aware, meaning the model attempts to produce motion that matches real-world movement patterns rather than arbitrary animation. Output does not include audio.

Is Cosmos 3 available through a hosted API?

NVIDIA is rolling out Cosmos 3 through NIM microservices on its cloud AI platform. Teams that need on-premise inference can self-host using vLLM-Omni. The Hugging Face Inference API also provides access to the model for testing without local setup.

Why is the model called an omni-model?

Omni-model refers to its ability to accept and produce multiple modalities in a single architecture: text, images, video, audio, and action trajectories. Most open models handle one or two modalities. Cosmos 3 handles all five through its dual-tower MoT design, with specialized sub-models for tasks that need focused performance.

What to Do Next

Download the Nano model for local testing: nvidia/Cosmos3-Nano on Hugging Face
Try the Text2Image variant via the Hugging Face Inference API before committing to local hardware
Read the full release post on the Hugging Face blog for benchmark methodology details
Check the Artificial Analysis leaderboard to see how Cosmos 3 ranks against commercial models