Tencent has publicly released OmniWeaving, a unified video generation model built by the HunyuanVideo team that handles seven distinct tasks within a single architecture. The model weights, inference code, and a new benchmark are all available on HuggingFace and GitHub as of April 3, 2026.

What Happened

OmniWeaving combines a multimodal large language model (8.3B parameters) with a diffusion transformer (7B parameters) and a visual tokenizer to process interleaved text, image, and video inputs. The system supports text-to-video, image-to-video, key-frame interpolation, reference-driven generation, video editing, multi-image composition, and reasoning-augmented generation.

The reasoning mode is the standout feature. Before generating video, the language model produces intermediate reasoning steps to interpret ambiguous or complex prompts. Tencent calls this thinking mode, and it allows the system to disambiguate user intent before committing to a generation path.

Why It Matters

Most video generation models specialize in one or two tasks. Text-to-video and image-to-video are common, but editing, interpolation, and compositional generation typically require separate tools or pipelines. OmniWeaving collapses all of these into a single model, which simplifies workflows for creators who currently chain multiple tools together.

The reasoning layer adds a capability that few video models offer. Where tools like Google Veo 3.1 and Wan2.7 via ComfyUI excel at single-task generation, OmniWeaving can interpret complex multi-step instructions by reasoning through them first. The team reports state-of-the-art performance among open unified models on their IntelligentVBench benchmark.

Key Details

  • Architecture: MLLM (8.3B) + MMDiT (7B) + VAE, built on HunyuanVideo-1.5
  • Tasks: 7 unified capabilities from text-to-video to reasoning-augmented generation
  • Thinking mode: MLLM generates reasoning steps before video generation begins
  • Hidden States DeepStacking: Extracts multi-layer features for finer compositional control
  • Benchmark: IntelligentVBench, a new evaluation suite for unified video generation, released alongside the model
  • Resources: Project page with demos, arXiv paper, model weights on HuggingFace

What to Do Next

Video creators working with local generation pipelines should evaluate OmniWeaving for multi-task workflows. The model requires multi-GPU inference (the repo recommends 8 GPUs via torchrun), so it is best suited for teams or cloud setups rather than single-GPU workstations. For lighter use cases, pairing a single-task model like Wan2.7 with Netflix VOID for object removal may be more practical. The IntelligentVBench benchmark is worth watching as a new standard for evaluating unified video generation systems.