Researchers from Shanghai Jiao Tong University and Sand.ai released daVinci-MagiHuman, a 15-billion-parameter open-source model that generates realistic human video with synchronized speech. The model beats commercial alternatives in human evaluation and runs inference in two seconds on a single H100 GPU.

For the broader landscape, see our complete guide to AI video generation in 2026.

What Happened

MagiHuman is a single-stream transformer that generates both video and audio in one pass. The 15B-parameter model handles six languages (English, Mandarin, Cantonese, Japanese, Korean, German, French) and coordinates facial expressions with speech timing automatically. In human evaluation across 2,000 pairwise comparisons, MagiHuman won 80% of the time against Ovi 1.1 and 60.9% against LTX 2.3.

The architecture uses a "sandwich" design where the first and last four layers handle modality-specific processing while the middle 32 layers share parameters across video and audio. This unified approach eliminates the need for separate models or post-processing alignment that most talking-head generators require. A timestep-free denoising method lets the model infer its own progress through the generation process without external scheduling.

Performance scales across resolutions: 2 seconds for 256p preview, 8 seconds for 540p, and 38.4 seconds for full 1080p on a single H100. The team also released distilled variants that generate in 8 steps without classifier-free guidance, plus a turbo VAE decoder and MagiCompiler for additional speedups. Everything ships under Apache 2.0 with full code, weights, and documentation.

Why It Matters for Creators

Open-source human video generation has lagged behind commercial services like HeyGen and Synthesia, which charge per minute and lock creators into their platforms. MagiHuman runs locally on a single GPU with no API costs and no usage limits. For creators producing talking-head content, training videos, or multilingual presentations, the economics shift from per-video pricing to a one-time hardware investment.

The synchronized speech generation is particularly relevant. Most open-source video models generate visuals only, requiring separate TTS and lip-sync pipelines. MagiHuman handles the entire pipeline in one model, reducing both complexity and the uncanny valley artifacts that come from stitching separate systems together.

What to Do Next

The model, code, and documentation are available on HuggingFace. Docker is the recommended installation path. You will need an H100 or equivalent GPU for reasonable generation speeds. The distilled variant is the fastest option for iterating on outputs before committing to full-resolution renders.


This story was covered by Creative AI News.

Subscribe for free to get the weekly digest every Tuesday.