Microsoft open-sourced Lens, a 3.8-billion-parameter text-to-image diffusion model, on May 25, 2026. The model generates images competitive with FLUX and Stable Diffusion 3 while requiring significantly less training compute, released under the MIT license with weights available on HuggingFace.

What Happened

Lens is available on HuggingFace in three variants: the default 20-step model, a distilled Lens-Turbo that generates 1024x1024 images in 0.84 seconds at just 4 steps, and a Lens-Base supervised checkpoint for benchmarking. The architecture pairs a 48-block Multimodal Diffusion Transformer (MMDiT) denoiser with a FLUX.2 semantic VAE and GPT-OSS text encoder.

Training used the Lens-800M dataset: 800 million image-text pairs with captions generated by GPT-4.1 at an average length of 109 words per caption. Microsoft describes this approach as "maximizing information density per training batch," which is how a 3.8B model competes with models twice its size.

Why It Matters

For image creators running local generation pipelines, the size-to-quality ratio is the headline. Where running FLUX locally at full quality requires substantial GPU memory, Lens targets competitive output at a fraction of the parameter count. The 0.84-second per-image speed of Lens-Turbo changes iteration speed for concept batching and style exploration.

The MIT license removes the licensing ambiguity that exists with some competing open weights. Creators can integrate it into pipelines and tools without legal uncertainty, though Microsoft designates this release as research-only and does not clear it for production deployment.

Key Details

  • Parameters: 3.8B (MMDiT architecture with FLUX.2 VAE)
  • Default model: 20 steps, guidance scale 5.0, RL-tuned for visual quality
  • Lens-Turbo: 4 steps, 0.84 seconds per 1024x1024 image
  • Resolution range: Up to 1440x1440, aspect ratios 1:2 through 2:1
  • Training data: 800M image-text pairs with GPT-4.1 long-form captions
  • License: MIT (research use; not cleared for commercial deployment)
  • Paper: arXiv 2605.21573

What to Do Next

The fastest entry point is the live demo on HuggingFace Spaces with no local setup needed. For local use with diffusers:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("microsoft/Lens", dtype=torch.bfloat16, device_map="cuda")
image = pipe("A cinematic mountain lake at sunrise, soft mist, detailed reflections").images[0]

Switch to microsoft/Lens-Turbo with num_inference_steps=4 and guidance_scale=1.0 for the fast path. ComfyUI integration is in progress via an open pull request, with early testers reporting consistent output across portrait, landscape, and square formats. If you are new to ComfyUI workflows, the MooshieUI guide covers the basics.