Google DeepMind released DiffusionGemma on June 10, 2026, an experimental open model that generates text up to 4x faster than equivalent autoregressive models by denoising whole blocks of tokens in parallel instead of writing one word at a time. It ships under an Apache 2.0 license, runs on a single consumer RTX GPU, and is downloadable today.

For creators who run local assistants, draft scripts, or chain agent steps, the headline is speed without a cloud bill. DiffusionGemma is built on the same Gemma 4 family that already powers on-device creative tools, but it swaps the generation method underneath. This is the first time Google has put a diffusion text model in the open-weights tier, and it lands the same week several labs pushed 1,000-token-per-second milestones.

What Happened

DiffusionGemma is a 26-billion-parameter mixture-of-experts model with 3.8 billion parameters active during inference. According to Google's announcement, it generates 256 tokens per forward pass using bi-directional attention, reaching more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on a GeForce RTX 5090. Quantized, it fits in roughly 18GB of VRAM, which puts it inside reach of a high-end desktop card.

NVIDIA published two companion guides the same day: an RTX AI Garage post covering local use on GeForce hardware and a developer guide with throughput benchmarks. The weights are available through Hugging Face, with serving support across vLLM, MLX, Hugging Face Transformers, and llama.cpp coming soon, plus a hosted endpoint on NVIDIA's build platform.

How Diffusion Text Generation Differs From Autoregressive Models

A standard language model is autoregressive: it predicts one token, appends it, then predicts the next, conditioned on everything so far. That sequential dependency is why long outputs feel like they arrive in a slow stream. A diffusion model instead starts from a noisy block of placeholder tokens and refines the entire block over several steps, the same way image diffusion sculpts a picture out of static.

The practical payoff is parallelism. Because the model writes many tokens at once and attends to context in both directions, it is well suited to code infilling and inline editing, where the surrounding text on both sides matters. Google previewed this approach last year with the cloud-only Gemini Diffusion research demo. DiffusionGemma is the production-minded, open-weights descendant of that work, which means the technique is no longer locked behind an experimental waitlist.

DiffusionGemma denoises blocks of tokens in parallel instead of one word at a time

DiffusionGemma vs Fast Autoregressive Models

Speed alone is not new. The race to 1,000 tokens per second has been running for months, and an autoregressive model like Xiaomi's MiMo-v2.5-Pro-UltraSpeed already hit that mark. What sets DiffusionGemma apart is the combination: open weights, local hardware, bi-directional editing, and diffusion-style parallel decoding in one package.

How DiffusionGemma compares to other fast text models
ModelMethodReported speedOpen weightsRuns locally
DiffusionGemma 26BDiffusion (parallel)1,000+ tok/s (H100), 700+ (RTX 5090)Yes (Apache 2.0)Yes (~18GB VRAM)
Gemini DiffusionDiffusion (parallel)~1,400 tok/sNoNo (cloud demo)
Xiaomi MiMo-v2.5-Pro UltraSpeedAutoregressive~1,000 tok/sYesYes
Standard Gemma 4AutoregressiveBaselineYes (Apache 2.0)Yes

The asterisk worth knowing: diffusion models trade some control for speed. Quality at very low step counts can wobble on long reasoning chains, and tooling for samplers and step schedules is younger than the mature autoregressive stack. For chat, drafting, and agent loops the speed wins; for delicate multi-step reasoning, test before you swap a production model.

DiffusionGemma generates text about 4x faster than autoregressive models

How to Run DiffusionGemma Locally

You can have it generating on a single GPU in a few steps:

1. Check your hardware. The quantized model needs about 18GB of VRAM, so an RTX 4090, RTX 5090, or RTX PRO 6000 workstation card will run it. NVIDIA's DGX Spark and DGX Station are also supported.

2. Pull the weights. Download the model from Hugging Face and accept the Gemma license terms. The instruction-tuned checkpoint is the one to grab for chat and agent use.

3. Serve it. Load it through Hugging Face Transformers or vLLM for throughput. If you only want to test prompts first, the hosted endpoint at build.nvidia.com requires no local setup.

4. Fine-tune if needed. For a custom voice or a domain dataset, Unsloth and NVIDIA NeMo both support DiffusionGemma fine-tuning, so you can adapt it to a brand style without renting cloud GPUs by the hour.

DiffusionGemma runs locally on a personal device

Why It Matters for Creators

Three creator workflows get cheaper and faster immediately. Local writing assistants stop waiting on a streaming cursor, so brainstorming and rewriting feel interactive. Agent pipelines that fire many short generations, such as tagging a media library or expanding a shot list, finish in a fraction of the wall-clock time because each step returns a block at once. And privacy-sensitive work, like drafting scripts or client copy you do not want sent to a third-party API, stays on your own machine.

Because the license is Apache 2.0, there is no per-token meter and no usage cap. That is the same economic shift that made open image models a staple of creator pipelines, now arriving for fast text. The open weights also mean ComfyUI nodes, desktop apps, and editing plugins can embed it directly rather than calling out to a paid endpoint.

The timing matters too. DiffusionGemma lands alongside a wave of on-device creative releases built on the Gemma 4 base, and a shared architecture makes it easier to run several specialized models from one tooling stack. A studio that already keeps a local image or audio model warm on an RTX card can add fast local text without learning a new runtime, which lowers the bar for fully offline creative pipelines that never touch a metered API.

Frequently asked questions

What is DiffusionGemma?

DiffusionGemma is an open-weights text generation model from Google DeepMind, released June 10, 2026. It uses a diffusion process to write blocks of tokens in parallel, reaching up to 4x the speed of comparable autoregressive models while running on a single consumer GPU.

Is DiffusionGemma free to use?

Yes. It is published under an Apache 2.0 license, so you can download the weights, run them locally, fine-tune the model, and use outputs commercially without per-token fees, subject to the Gemma usage terms.

What hardware do I need to run DiffusionGemma?

The quantized model needs roughly 18GB of VRAM. A GeForce RTX 4090 or RTX 5090, an RTX PRO 6000 workstation card, or NVIDIA's DGX Spark and DGX Station will all run it. NVIDIA reports more than 700 tokens per second on an RTX 5090.

How is it different from a normal Gemma 4 model?

Standard Gemma 4 is autoregressive and generates one token at a time. DiffusionGemma is built on the same Gemma 4 architecture but generates 256 tokens per forward pass through diffusion-style refinement with bi-directional attention, which makes it faster and better at infilling and inline editing.

How does DiffusionGemma compare to Gemini Diffusion?

Both use diffusion for text, but Gemini Diffusion is a closed, cloud-only research demo. DiffusionGemma is the open-weights version you can download and run on your own hardware, which is the key difference for creators who want local, offline control.

Should I replace my current text model with it?

For interactive chat, drafting, and high-volume agent loops, the speed and zero-cost local execution are compelling. For long, delicate multi-step reasoning, test it against your current model first, since diffusion text tooling is newer and quality can vary at very low step counts.