Shengshu AI on May 28 released minWM, an Apache 2.0 framework that converts open-source text-to-video diffusion models into real-time interactive world models. The release ships with end-to-end recipes for two backbones, trained checkpoints on Hugging Face, and a technical report dated May 29.

Try it: spin up an interactive Wan2.1 world model

Clone the repo, create a Python 3.10 conda environment, run pip install -r requirements.txt, then install flash-attn separately. Set PYTHONPATH to include the HY15, Wan21, and shared directories, and run the included streaming inference script against the published Wan2.1-T2V-1.3B checkpoint. Camera-control inputs feed the autoregressive generator step by step, and the 4-step Asymmetric DMD distillation gets you interactive frame rates on a single GPU instead of the 50-step bidirectional sampling the base model needs.

Why this matters

Real-time interactive video has been gated behind closed labs: Genie 3, World Labs, and the Hunyuan-GameCraft line all ship as demos or hosted endpoints. minWM is the first open-source full-stack pipeline that turns the two most popular community video bases, Wan2.1 and HunyuanVideo 1.5, into controllable autoregressive generators you can fine-tune locally. For indie game devs, VR prototypers, and AI animators, that is the difference between renting capacity and owning the stack.

Key details

The framework documents data construction, autoregressive fine-tuning, and distribution-matching distillation in a single tutorial-style codebase. Supported backbones are Wan2.1-T2V-1.3B (cross-attention DiT) and HunyuanVideo 1.5 (MMDiT). Each stage exposes checkpoints so creators can stop, swap, or fork at any point, and the repo includes integrate-new-backbone and debug-world-model Claude skills to guide adding a new DiT or diagnosing training failures. Weights are under the MIN-Lab Hugging Face org. The work joins a recent wave of efficient open video research, including PARE's half-compute trick for Wan2.1-14B.

What to do next

If you train video models, mirror the repo and run the 1.3B recipe end to end before committing to the 8B path; the smaller backbone fits on a 24 GB card and the recipe surfaces the gotchas. If you build creator tools, watch for community forks that wrap the streaming inference in a ComfyUI node, which is where this will reach most artists first.