NVIDIA published the Nemotron-Labs-Diffusion family on Hugging Face, a set of open-weights large language models that switch at inference time between autoregressive decoding, diffusion-based parallel decoding, and a linear self-speculation hybrid that uses the diffusion path to draft tokens and the autoregressive path to verify them. The release spans 3B, 8B, and 14B variants in Base and Instruct, plus an 8B vision-language model.
What This Enables
The headline win is throughput. On an NVIDIA DGX Spark workstation running 4-bit weight quantization, the 8B model hits 112 tokens per second in diffusion mode versus 41.8 tokens per second in pure autoregressive mode, a 2.7x speedup. On a GB200 server-class card the spread widens to 850 tokens per second versus 253, with custom CUDA kernels pushing throughput to 1015 tokens per second. For a creator running a local LLM through vLLM or SGLang, the practical effect is that long-response prompts (code generation, batch summarization, agent loops) finish in roughly a third of the time without changing the model class or quality. Workflow: swap your current local 8B model for Nemotron-Labs-Diffusion-8B, run inference in linear-spec mode, and measure tokens per second on your highest-volume prompt.
Why It Matters
Diffusion-based LLMs have been a research thread for two years and a product category for less than one. The pattern treats text generation as a denoising step over multiple tokens at once rather than one token at a time, trading a bit of accuracy for a large speed gain. Nemotron-Labs-Diffusion is the first time NVIDIA has shipped an open-weights diffusion LLM family at multiple parameter sizes with the same model supporting all three decoding modes. That matters because the deployment story for a local creator stack is usually "pick one inference runtime per model." A tri-mode model lets the same checkpoint serve interactive chat (autoregressive), bulk generation (diffusion), and agent loops (self-speculation) without swapping artifacts. It also lands a week after the DeepSeek V4 Flash drop, putting open-weights efficiency back in the news cycle alongside frontier-tier models like Qwen 3.7.
Key Details
- Sizes: 3B, 8B, 14B in Base and Instruct, plus 8B VLM
- License: NVIDIA Nemotron Open Model License (commercial use under standard terms)
- Inference runtimes: vLLM, SGLang, Docker, with optional LoRA-enhanced drafter
- Speedup: 2.7x on DGX Spark, 3.3x on GB200, up to 4x with custom CUDA kernels
- Higher quality: 14B Base for stronger output; VLM-8B for vision-language workflows
What to Do Next
Creators running a local LLM stack on a 24GB or larger GPU should benchmark the 8B Base variant against their current model on a representative workload. Teams already on vLLM can pull the model with trust_remote_code and switch to linear-spec mode without changing the API surface. Hosted-model users get nothing direct yet, but downstream inference providers will likely add Nemotron-Labs-Diffusion within 30 days given the throughput economics.