NVIDIA's SANA-WM is a 2.6 billion-parameter open-source world model that generates full-minute, 720p AI video with 6-degree-of-freedom camera control on a single GPU, a capability that previously required cloud-scale industrial infrastructure. The model is released under Apache 2.0 and trains on ~213K public video clips in 15 days using 64 H100 GPUs. Its distilled variant generates a 60-second 720p clip on an RTX 5090 in 34 seconds, with throughput 36 times higher than prior open-source baselines.

What Happened

NVIDIA Labs published SANA-WM on May 16, 2026, releasing the model weights and paper alongside the project page and source code on GitHub. The full title of the research is "SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer."

This is not an incremental update. Every open-source video model before SANA-WM has maxed out at roughly 6-10 seconds of coherent output. SANA-WM pushes that ceiling to 60 seconds (ten times longer) at 720p resolution with controllable camera movement throughout the entire sequence.

What SANA-WM Is

SANA-WM is a world model: an AI system trained to predict what a scene looks like from any camera angle at any point in time, given a starting frame and a movement trajectory. Unlike standard video generators that hallucinate motion frame-by-frame, a world model maintains 3D spatial understanding across the full sequence. This is why it can hold 60 seconds of consistent scene geometry rather than drifting apart after a few seconds.

The "WM" suffix distinguishes this from SANA's earlier image generation work. SANA-WM is purpose-built for video, embodied AI, and simulation use cases where long temporal coherence and precise camera control are non-negotiable.

The Four Core Innovations

Four innovation cubes representing SANA-WM core features

The paper details four technical designs that together enable minute-scale video on consumer hardware.

1. Hybrid linear attention. Standard transformer attention scales quadratically with sequence length, making long video computationally prohibitive. SANA-WM combines frame-wise linear attention with softmax attention in a hybrid design that handles long-context sequences without the memory explosion. This is the primary reason the model fits on a single GPU.

2. Dual-branch camera control. Camera trajectory is injected through a two-branch conditioning mechanism: one branch for global camera pose (where you are), one for local movement (how you're moving). The result is 6-DoF (degrees of freedom) camera control: pan, tilt, roll, plus X/Y/Z translation. You can specify a precise dolly-in, orbit, or crane shot and the model follows it throughout the full 60-second clip.

3. Two-stage generation pipeline. Rather than generating all frames at once, SANA-WM uses a keyframe stage followed by an interpolation stage. The first stage establishes spatial anchors for scene consistency; the second fills in temporal detail between them. This mirrors how professional animators think about shot design and significantly reduces temporal drift in longer clips.

4. Robust annotation pipeline. Training data quality is often the ceiling for video models. The team built an automated pipeline to extract camera poses from publicly available video, enabling training on 213K clips without proprietary datasets. This annotation approach is part of the open release and is immediately reusable by the community for fine-tuning.

How SANA-WM Compares to Existing Open-Source Video Models

Bar chart comparing SANA-WM to other open-source video models
Model Max Duration Resolution Parameters License Camera Control Single GPU
SANA-WM 60 seconds 720p 2.6B Apache 2.0 6-DoF Yes (distilled)
HunyuanVideo ~5 seconds 720p 13B Open weights Limited High VRAM required
CogVideoX 6 seconds 480p/720p 5B Apache 2.0 None 24GB VRAM
LTX Video 2.3 ~10 seconds 480p-720p 2.1B Open weights Limited Yes
Wan-2.1 ~8 seconds 480p 14B Apache 2.0 None High VRAM required

The gap is stark. SANA-WM generates six to twelve times more video than its nearest open-source peers, at higher resolution, with explicit camera control, at 36 times the throughput, with a smaller parameter count than most of the competition. The architectural bet on hybrid linear attention rather than brute-force scale is paying off.

Commercial models like Kling 3.0 (covered in detail in our Kling 3 workflow tutorial) and Seedance 2.0 still lead on visual fidelity and production polish, but they are closed-source and priced per clip. SANA-WM changes the calculus for creators who need long-form video at scale or who want to fine-tune on their own footage.

What Creators Get

The practical implications break down by workflow type.

Scene filmmakers and animators. A 60-second clip is a scene, not a fragment. Previous open-source video was limited to establishing shots and short cuts; you stitched together a longer sequence from disconnected generations. SANA-WM can hold a full scene from wide to close, following a camera path, without the seams. The 6-DoF camera control means you can specify a dolly-in that starts at 3 meters and closes to 0.5 meters over 45 seconds; the model follows it.

Game and world builders. The paper frames SANA-WM explicitly as an embodied AI and simulation tool. For creators building virtual worlds or game cinematics, consistent 3D spatial understanding across 60 seconds means scene walk-throughs, fly-throughs, and exterior reveals are now possible without hand-keyframing every camera position in a 3D package.

Creators who need local generation. The distilled variant running on a single RTX 5090 in 34 seconds changes the economics entirely. A 60-second 720p clip for $0 compute cost on your own machine eliminates the per-generation pricing that makes iterating on commercial video tools expensive. For creators already running LTX Director in ComfyUI, SANA-WM is a direct upgrade path for longer content.

Fine-tuners and researchers. Apache 2.0 licensing means you can fine-tune on proprietary footage, use in commercial products, and modify the architecture. The annotation pipeline for camera pose extraction is included; apply it to your own video dataset.

Creator Outcome: How to Integrate

SANA-WM's documentation is marked "coming soon," meaning inference code and fine-tuning guides are not yet finalized. Here is what you can do now and what to watch for when the full release lands.

Now: Clone the NVlabs/Sana repository, review the architecture code and camera conditioning format, and read the paper to understand the 6-DoF trajectory specification (Section 3.2 covers the dual-branch conditioning API in detail). If you are building a ComfyUI integration, the dual-branch conditioning mechanism is designed to be modular, which is where community custom nodes will hook in.

When inference drops: The priority workflow is camera-controlled scene generation. Specify a text prompt (scene description), a starting frame, and a camera trajectory as a sequence of 6-DoF poses. The model generates the full clip. For creators currently using ControlNet-based camera control in image generation, the mental model is similar but extended into time.

For fine-tuning: The annotation pipeline generates camera pose labels from unlabeled video using structure-from-motion. Run your footage through it to build a labeled dataset, then fine-tune SANA-WM on your visual style. The Apache 2.0 license covers commercial use of fine-tuned derivatives.

ComfyUI integration is the most likely first community port given existing nodes for video model loading and inference. Watch the ComfyUI-Manager registry for an NVlabs/Sana node package in the days following the inference release.

Availability and Hardware Requirements

GPU card with temperature gauge for SANA-WM hardware requirements

The model weights and inference code are on a "coming soon" timeline as of May 16, 2026. Based on the paper, expect:

  • Full 2.6B model: estimated 24-32GB VRAM for inference (similar to HunyuanVideo at comparable resolution)
  • Distilled variant: runs on RTX 5090 (24GB VRAM) in 34 seconds per 60-second clip
  • Training: 64 H100 GPUs for 15 days (not consumer-feasible, but fine-tuning on smaller datasets should be accessible)

Apache 2.0 means the weights will be freely downloadable and commercially usable without restrictions.

Frequently Asked Questions

Does SANA-WM replace existing short-form video models?

Not immediately. Short-form models like LTX and CogVideoX are mature, have active community node ecosystems, and produce strong results in their native duration range. SANA-WM fills a different slot: scenes, not clips. The two will likely coexist in creator pipelines with SANA-WM handling establishing shots and long sequences while specialized models handle detail-heavy short cuts.

What does 6-DoF camera control actually mean in practice?

6-DoF stands for 6 degrees of freedom: pan left/right, tilt up/down, roll (rotate the horizon), plus X/Y/Z translation (move the camera physically through space). In video terms: a dolly-in moves along the Z axis. A crane shot combines Y translation with tilt. An orbit combines X/Z translation with pan. SANA-WM accepts these as a sequence of poses over time, giving you per-frame camera precision throughout the 60-second generation.

Can I use SANA-WM for commercial projects?

Yes. Apache 2.0 allows commercial use, modification, and distribution of both the model and derivative works, including fine-tuned versions trained on your own data. The only requirement is attribution in the source code or documentation.

How does the 36x throughput claim hold up?

The 36x figure is relative to prior open-source baselines at the same resolution and duration. The gain comes from hybrid linear attention, which scales linearly with sequence length instead of quadratically. At 720p for 60 seconds, the frame count is large enough that quadratic scaling becomes the dominant cost, which is where linear attention's advantage is most pronounced.

Is SANA-WM suitable for audio-synchronized video?

The current release focuses on visual generation and camera control. Audio synchronization is not a stated feature. For music-synchronized video generation, existing tools with dedicated audio conditioning remain the better option until SANA-WM adds audio input support.

When will ComfyUI support SANA-WM?

Community ports typically follow within 1-2 weeks of a model's inference code being published. The SANA-WM paper shows a modular dual-branch conditioning design that maps cleanly to ComfyUI's node graph model. Expect custom nodes shortly after the inference release drops.

What to Do Next

  • Star the NVlabs/Sana repository and watch for releases; inference code and weight downloads will appear there
  • Read Section 3 of the SANA-WM paper to understand the camera pose format before the inference release; knowing the input spec will speed up your first experiments
  • If you are on ComfyUI, check the ComfyUI-Manager registry weekly; community node ports will land there before any official guide
  • For immediate long-form video needs, Kling 3.0 remains the production-ready option while SANA-WM's inference tooling matures