Kiwi-Edit, a new open-source video editing framework from NUS ShowLab, launched on March 5, 2026, combining text instruction guidance with reference image control to handle both global and local video edits at 720p resolution. Built on Qwen2.5-VL-3B and Wan2.2-TI2V-5B, the MIT-licensed model scores 3.02 on OpenVE-Bench, the highest among open-source video editing methods.

What Happened

Researchers at the National University of Singapore's Show Lab released Kiwi-Edit, a unified framework for instruction-guided and reference-guided video editing. Unlike models that rely solely on text prompts, Kiwi-Edit lets users supply a reference image alongside natural language instructions to guide the visual output. The full release includes all datasets, model weights, training code, and a HuggingFace demo for immediate testing.

Why It Matters

Text-only video editing hits a wall when you need a specific visual style or object appearance that words cannot precisely describe. Kiwi-Edit solves this by accepting reference images as a second input channel. Need a character wearing a specific outfit or a scene in a particular art style? Provide a reference image and the model handles the translation. This dual-guidance approach is a meaningful step forward for creative workflows where precision matters more than convenience. The MIT license means developers can integrate it into commercial products without restriction.

Key Details

  • Architecture: Qwen2.5-VL-3B vision-language model for semantic understanding paired with Wan2.2-TI2V-5B video diffusion transformer for generation
  • Training data: 477,000 high-quality quadruplets (source video, instruction, reference image, edited video)
  • Benchmark: 3.02 overall on OpenVE-Bench (evaluated by Gemini-2.5-Pro), highest among open-source methods across five editing categories
  • Global edits: Style transfers including cartoon, sketch, watercolor, and other visual aesthetics
  • Local edits: Object removal, object addition, object replacement, and background swaps
  • Resolution: 720p output quality
  • License: MIT (fully permissive for commercial use)

What to Do Next

The complete codebase is available on GitHub with setup instructions and a demo script. The project page includes video examples showing each editing category in action. For a deeper understanding of the architecture and training pipeline, the full research paper on arXiv covers the three-stage training strategy and ablation studies. Video creators working with AI editing tools should evaluate Kiwi-Edit against their current pipeline, particularly for tasks where reference image guidance could replace lengthy prompt engineering. The MIT license and available training code also make it a strong foundation for fine-tuning on domain-specific editing tasks.