NVIDIA Toronto AI Lab open-sourced PiD (Pixel Diffusion) on May 21, 2026, a plug-and-play decoder that replaces the VAE in Stable Diffusion, FLUX, and Z-Image pipelines and outputs 2K or 4K images in a single distilled pass. The accompanying arXiv paper reports a 512 to 2048 decode in under one second on an RTX 5090 and 210 ms on a GB200, roughly six times faster than cascaded latent-then-upscale workflows.
How to Try It This Week
If you already run a diffusion pipeline locally, PiD is the easiest decoder swap you will see this year. Clone the repo, install dependencies, and download a backbone-matched checkpoint from huggingface.co/nvidia/PiD. The repo ships two inference modes: from_ldm_* for text-to-image with FLUX, FLUX2, SD3, or Z-Image, and from_clean_* for upscaling existing images through DINOv2 or SigLIP encoders. Pick the 2K checkpoint for a drop-in VAE replacement at 4x upscale, or the 2k-to-4k variant for a second pass that pushes output to 3840 pixels on the long edge. A multi-GPU launcher (torchrun) is included for batch jobs, and peak VRAM stays around 13 GB so a single RTX 4090 or 5090 is enough for sample-rate work.
What Happened
The Toronto AI Lab team (Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, and Xuanchi Ren) submitted PiD to arXiv on May 22 with a public GitHub release the day before. The technique reformulates the latent-to-pixel stage as a conditional pixel-space diffusion model rather than a reconstruction-trained VAE decoder, then distills the process down to four denoising steps. Crucially, the same decoder upsamples while decoding, collapsing two stages of the usual high-resolution pipeline into one. License is Apache 2.0, including commercial use.
Why It Matters for Creators
VAE artifacts (soft eyes, flat textures, color shifts) are the most common complaint in 2026 image-generator comparisons, and most fixes today require chaining a separate ESRGAN, SUPIR, or Topaz pass after generation. PiD removes that second pass entirely. For ComfyUI builders running Z-Image Turbo workflows or FLUX-based portrait pipelines, swapping in the matching PiD checkpoint should yield sharper hair, finer fabric weave, and cleaner skin micro-texture without changing the prompt or sampler. The 8x Scale-RAE variant in particular means a single 512 latent can fan out to a 4K hero asset suitable for print or large-format display.
Key Details
Backbones officially supported on day one: FLUX, FLUX2, SD3, Z-Image, plus encoder-side adapters for DINOv2 and SigLIP. Resolution ladder: 2K checkpoint at 4x upscale (so 512 latents go to 2048 pixels), 2k-to-4k checkpoint at 8x where the Scale-RAE variant is used. Quality benchmarks in the paper compare against the standard FLUX VAE, the SD3 VAE, and cascaded latent-upscale baselines on FID and CLIP score, with PiD matching or beating each. The project page hosts sample grids and a side-by-side video.
What to Do Next
Test PiD against your current decoder on a fixed seed and prompt set before swapping any production workflow. If you are on ComfyUI, watch for a community node within the next week (the API mirrors the standard vae_decode call closely). If you publish stills to clients, the Apache 2.0 license clears commercial work without further negotiation, which is unusual for an NVIDIA research drop. Keep an eye on the GitHub issues tab for the first wave of community-trained checkpoints on alternative backbones.