Inclusion AI, the research arm of Ant Group, released LLaDA2.0-Uni on April 22, a 16-billion-parameter Mixture-of-Experts diffusion model that handles text-to-image generation, instruction-based image editing, and visual understanding in a single checkpoint. The model ships under Apache 2.0 with full weights on HuggingFace and code on GitHub.

For the broader landscape, see our complete guide to AI image generation in 2026.

What Happened

LLaDA2.0-Uni uses a discrete diffusion architecture rather than autoregressive token prediction. A SigLIP-VQ tokenizer converts images to discrete semantic tokens, while a block-wise mask prediction paradigm handles generation across modalities. Despite 16B total parameters, only roughly 1B activate per token during inference thanks to the MoE routing, keeping compute costs manageable.

The model unifies three capabilities that typically require separate specialized models: generating images from text prompts, editing existing images via natural language instructions, and answering questions about visual content. A SPRINT acceleration system with KV cache reuse enables rapid 8-step inference through a distilled decoder.

Why It Matters

Most open-source image models force creators to juggle separate tools for generation, editing, and captioning. LLaDA2.0-Uni collapses that stack into one model, reducing pipeline complexity for anyone building creative workflows. The Apache 2.0 license means commercial use is unrestricted, and the full model can run locally without API dependencies.

For creators already using node-based pipelines like ComfyUI, a single unified model simplifies node graphs considerably. Instead of routing between a generation model, an editing model, and a vision-language model, one checkpoint handles everything.

Key Details

  • Architecture: 16B MoE diffusion LLM, ~1B active parameters per token
  • Capabilities: Text-to-image, instruction-based editing, image understanding
  • Inference: 8-50 step generation via distilled decoder with SPRINT acceleration
  • VRAM: ~47 GB for full generation, ~35 GB for understanding only
  • License: Apache 2.0 (fully open, commercial use allowed)
  • Benchmark: 73.18 average score, competitive with Qwen3-30B; 94.51 on HumanEval
  • Paper: arXiv 2604.20796

What to Do Next

The model is available now on HuggingFace with inference code on GitHub. You will need a GPU with at least 47 GB VRAM for full generation capabilities, or 35 GB for understanding-only mode. The Nucleus-Image MoE diffusion model released earlier this month offers a lighter alternative if VRAM is a constraint.