On May 10, 2026, HiDream-ai released the technical report and interactive demos for HiDream-O1-Image, an open-source 8-billion-parameter image model that beats FLUX.2 Dev across every published benchmark despite being 7x smaller. The model runs at native 2,048 x 2,048 resolution, carries an MIT license, and is available today via HuggingFace, a ComfyUI node, and a fal.ai API for creators without the GPU headroom to run it locally.

What Happened

HiDream-ai published the HiDream-O1-Image weights on GitHub and HuggingFace on May 8, followed by the full technical report and live demos on May 10. Two checkpoints shipped at launch: the full model (50 inference steps, CFG 5.0) and a distilled Dev variant (HiDream-O1-Image-Dev, 28 steps, CFG 0.0). Both run at 8B parameters. A larger Pro variant is referenced in benchmark tables but was not part of the open-source release.

The model was built by HiDream-ai, the same team behind HiDream-I1 and the earlier HiDream image generation series. The O1 suffix signals a shift toward a reasoning-first approach: a built-in prompt agent reasons through layout, subject attributes, and physical logic before generating.

The Architecture Shift: No VAE, No Separate Text Encoder

Most diffusion-based image models follow the same structural pattern: a variational autoencoder compresses pixels into a latent space, a separate text encoder maps prompts to embeddings, and the diffusion model operates in that compressed latent space before a decoder reconstructs the image. HiDream-O1-Image removes all of that.

VAE and Encoder eliminated, replaced by unified 8B block

Instead, it uses a Pixel-level Unified Transformer (UiT) that processes pixel patches, text tokens, and task-condition tokens in a single shared token space. There is no VAE, no external text encoder, and no latent-space bottleneck. The model generates directly in raw pixel space from the first forward pass.

For creators this has two practical consequences. First, deployment is simpler: fewer components means fewer version-mismatch issues and less disk space. Second, the model has native understanding of the full image resolution from token one, which appears to explain its strong performance on long-text rendering and precise spatial composition.

Benchmark Results: 8B vs 56B

The technical report compares HiDream-O1-Image against FLUX.2 Dev (56B parameters) and Qwen-Image on four standard benchmarks:

Small 8B cube with trophy next to large 56B cube
Benchmark HiDream-O1 8B FLUX.2 Dev 56B Qwen-Image
GenEval 0.90 0.87 0.87
DPG-Bench 89.83 87.57 88.32
HPSv3 10.37 9.28 N/A
CVTG-2K 0.9128 0.8926 N/A

It also ranked 8th on the Artificial Analysis Text-to-Image Arena leaderboard as of May 5, 2026, which is a human preference ranking based on side-by-side comparisons, not just automated metrics.

For context on what these benchmarks measure: GenEval tests compositional accuracy (can the model generate exactly what the prompt specifies: two objects, left-right positioning, color attribution); DPG-Bench tests dense prompt adherence; HPSv3 tests overall aesthetic quality against human rater preference. Winning all three at 8B is a meaningful result.

Six Things It Can Do

HiDream-O1-Image supports six distinct generation modes out of the box, which is broader than most open-source releases at this size:

  1. Text-to-image: standard prompt-to-image at up to 2K resolution
  2. Instruction-based editing: edit an existing image using a natural-language instruction (no mask required)
  3. Subject-driven personalization: provide multiple reference images of a subject, generate consistent variations
  4. Storyboard generation: create sequences of images with character and scene consistency
  5. Long-text rendering: accurate rendering of long strings of text in both English and Chinese (LongText EN: 0.979, LongText ZH: 0.978)
  6. Native 2K resolution synthesis: generates at 2048x2048 without upscaling

The editing and personalization modes make this more useful for production workflows than a standard text-to-image release. Most open-source models at this scale require separate ControlNet or IP-Adapter extensions for those capabilities.

The Prompt Agent

HiDream-O1-Image ships with a prompt_agent.py that runs before image generation. The agent uses an LLM backend, either a local Gemma-4-31B-it or any OpenAI-compatible API endpoint, to reason through four dimensions of the user's prompt: spatial layout, subject attributes, physical logic, and text rendering requirements. It then rewrites the raw prompt into a structured, optimized version before passing it to the image model.

This is the "O1" in the name. Rather than treating prompt quality as the user's responsibility, the model includes a reasoning step that catches ambiguity and fills gaps before generation begins. Early community testing suggests it particularly helps with complex multi-subject compositions where a naive prompt tends to produce attribute binding errors.

Hardware Requirements and How to Access It

Running HiDream-O1-Image locally requires approximately 35GB of VRAM. On an RTX 4090 with FP8 quantization, the Dev variant generates in roughly 20 seconds per image. That puts local inference within reach of high-end consumer workstations but out of range for most mid-range setups.

If you don't have the hardware, there are two practical alternatives:

  • fal.ai API: The HiDream-O1-Image endpoint on fal.ai lets you call the model via API. Pay per generation, no GPU required. Good for prototyping or low-volume production use.
  • HuggingFace Spaces demo: The official Spaces demo runs the full model and the Dev variant for free with a queue. Useful for testing before committing to a local setup.

ComfyUI Integration

A community-built ComfyUI node for HiDream-O1-Image is already available on GitHub. If you're running ComfyUI workflows for image generation, you can add HiDream-O1 as a model node alongside FLUX or SD3 without restructuring your pipeline.

ComfyUI node graph workflow card

For deeper context on building production ComfyUI setups, the Best ComfyUI Workflows 2026 guide covers node structure, batching, and model switching patterns that apply directly here.

How This Compares to FLUX and Other Open-Source Models

The most direct comparison is FLUX.2 Dev, which has been the dominant open-source image model since Black Forest Labs released it. FLUX.2 Dev's 56B parameter count and strong prompt adherence made it the default for serious production workflows. HiDream-O1-Image's benchmark wins at 8B challenge that assumption: smaller, faster, and cheaper to run at scale.

The architecture difference (pixel-space vs latent-space) also matters for specific use cases. FLUX's VAE compression introduces a small fidelity ceiling on fine detail and text rendering that HiDream's direct pixel generation avoids. The long-text rendering scores (0.979 English, 0.978 Chinese vs FLUX's unlisted performance on this metric) reflect that advantage directly.

What to Do Next

Three paths depending on your setup:

  1. Try it now (no GPU required): Test the model at the HiDream-O1-Image-Dev Spaces demo. The Dev variant runs faster with no quality CFG, good for quick evaluation.
  2. Run it locally: Clone the GitHub repo (linked in "What Happened" above) and follow the inference guide. If you have an RTX 4090 or equivalent, start with the Dev variant at FP8 quantization.
  3. Add it to ComfyUI: Install the community ComfyUI node and drop it into an existing workflow. The architecture is different from FLUX, so you'll want to start with a fresh pipeline rather than adapting a FLUX graph directly.

Frequently Asked Questions

What license is HiDream-O1-Image released under?

MIT License. Commercial use is permitted with no royalty requirements and no usage restrictions beyond standard MIT terms.

How does HiDream-O1-Image differ from the earlier HiDream-I1 model?

HiDream-I1 used a more conventional latent-space diffusion architecture. HiDream-O1-Image is a complete architectural rebuild: pixel-space unified transformer, no VAE, no external text encoder, and a built-in reasoning-based prompt agent. The O1 designation reflects the shift to a reasoning-first approach.

Does it require flash-attn?

Flash-attn is recommended for performance but not required. The GitHub repo includes a fallback in pipeline.py that runs without it, at a cost of slower inference.

Can I use the fal.ai API for commercial production?

Yes, fal.ai's HiDream-O1-Image endpoint supports commercial use. The model's MIT license permits commercial generation, and fal.ai's terms allow production API usage. Check fal.ai's current pricing for volume rates.

Does the prompt agent require an external API key?

The prompt agent supports two backends: a locally-hosted Gemma-4-31B-it model (no external API needed) or any OpenAI-compatible API endpoint. If you're running a local Ollama or vLLM server, you can point the agent there.

Is there a video generation companion model?

No video generation model was announced alongside HiDream-O1-Image. The release focuses entirely on image synthesis, editing, and personalization. HiDream-ai has not announced video generation work publicly as of May 10, 2026.