NVIDIA Nemotron Diffusion: 3x Faster LLM Decoding

NVIDIA published the Nemotron-Labs-Diffusion family on Hugging Face, a set of open-weights large language models that switch at inference time between autoregressive decoding, diffusion-based parallel decoding, and a linear self-speculation hybrid that uses the diffusion path to draft tokens and the autoregressive path to verify them. The release spans 3B, 8B, and 14B variants in Base and Instruct, plus an 8B vision-language model.

What This Enables

The headline win is throughput. On an NVIDIA DGX Spark workstation running 4-bit weight quantization, the 8B model hits 112 tokens per second in diffusion mode versus 41.8 tokens per second in pure autoregressive mode, a 2.7x speedup. On a GB200 server-class card the spread widens to 850 tokens per second versus 253, with custom CUDA kernels pushing throughput to 1015 tokens per second. For a creator running a local LLM through vLLM or SGLang, the practical effect is that long-response prompts (code generation, batch summarization, agent loops) finish in roughly a third of the time without changing the model class or quality. Workflow: swap your current local 8B model for Nemotron-Labs-Diffusion-8B, run inference in linear-spec mode, and measure tokens per second on your highest-volume prompt.

Why It Matters

Diffusion-based LLMs have been a research thread for two years and a product category for less than one. The pattern treats text generation as a denoising step over multiple tokens at once rather than one token at a time, trading a bit of accuracy for a large speed gain. Nemotron-Labs-Diffusion is the first time NVIDIA has shipped an open-weights diffusion LLM family at multiple parameter sizes with the same model supporting all three decoding modes. That matters because the deployment story for a local creator stack is usually "pick one inference runtime per model." A tri-mode model lets the same checkpoint serve interactive chat (autoregressive), bulk generation (diffusion), and agent loops (self-speculation) without swapping artifacts. It also lands a week after the DeepSeek V4 Flash drop, putting open-weights efficiency back in the news cycle alongside frontier-tier models like Qwen 3.7.

Key Details

Sizes: 3B, 8B, 14B in Base and Instruct, plus 8B VLM
License: NVIDIA Nemotron Open Model License (commercial use under standard terms)
Inference runtimes: vLLM, SGLang, Docker, with optional LoRA-enhanced drafter
Speedup: 2.7x on DGX Spark, 3.3x on GB200, up to 4x with custom CUDA kernels
Higher quality: 14B Base for stronger output; VLM-8B for vision-language workflows

What to Do Next

Creators running a local LLM stack on a 24GB or larger GPU should benchmark the 8B Base variant against their current model on a representative workload. Teams already on vLLM can pull the model with trust_remote_code and switch to linear-spec mode without changing the API surface. Hosted-model users get nothing direct yet, but downstream inference providers will likely add Nemotron-Labs-Diffusion within 30 days given the throughput economics.

NVIDIA Nemotron Diffusion: 3x Faster LLM Decoding

What This Enables

Why It Matters

Key Details

What to Do Next

Keep reading

VNCCS Utils 0.5.3 Adds UniCanvas Infinite Canvas in ComfyUI

LTX Director 2.0: Free AI Video Editor for ComfyUI

How to Keep AI Characters Consistent Across Images

What This Enables

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

VNCCS Utils 0.5.3 Adds UniCanvas Infinite Canvas in ComfyUI

LTX Director 2.0: Free AI Video Editor for ComfyUI

How to Keep AI Characters Consistent Across Images

Stay ahead of Creative AI