NVIDIA Nemotron TwoTower: 2.4x Faster Diffusion LLM

NVIDIA has released Nemotron-Labs-TwoTower-30B, an open-weight diffusion language model that generates text 2.42 times faster than a standard autoregressive baseline while keeping 98.7 percent of its benchmark quality. What makes the release worth a closer look is not the speed number alone but how it gets there: instead of training a diffusion model from scratch, TwoTower bolts a second network onto a frozen autoregressive model and reuses everything the base model already learned. The weights are on Hugging Face under the NVIDIA Nemotron Open Model License, which permits commercial use, and the full method is documented in the TwoTower paper.

Background

For the last two years, diffusion language models have carried a simple promise: generate many tokens in parallel instead of one at a time, and throughput goes up. The catch has almost always been the price of entry. Most diffusion LLMs required training a fresh model end to end, which meant a second full pretraining run and a second set of quality risks. That kept parallel generation in the research column for most teams rather than the production column.

NVIDIA's own Nemotron-Labs-TwoTower is a direct answer to that cost. It is built on the existing Nemotron-3-Nano-30B-A3B backbone, and that backbone stays frozen. The speed gain comes from an added denoiser, not from a new base model, so a team already running the Nemotron line can adopt the throughput win without a fresh pretraining budget. It arrives alongside NVIDIA's earlier tri-mode Nemotron-Labs-Diffusion family, but takes a different architectural route to the same goal.

Two stacked matte 3D towers representing the TwoTower diffusion language model architecture — TwoTower pairs a frozen context tower with a trained denoiser tower.

Deep Analysis

Two Towers Instead of One Model

The architecture is exactly what the name suggests. A context tower holds the frozen autoregressive backbone and supplies representations. A denoiser tower sits on top and is trained to refine tokens in parallel. Each tower runs 52 layers built from the same interleaved recipe as the backbone: 23 Mamba-2 layers, 6 self-attention layers, and 23 mixture-of-experts layers, with 128 routable experts of which 6 activate per token plus 2 shared experts. The released checkpoint ships both towers at roughly 60 billion total parameters, with about 3 billion active per token per tower. The two towers talk through layer-aligned cross-attention, which gives the denoiser multi-scale access to the backbone's representations rather than a single pooled summary.

Why a Frozen Backbone Changes the Cost Equation

The economic story is in the training split. The frozen backbone was pretrained on roughly 25 trillion tokens; the added denoiser was trained on about 2.1 trillion. That is under a tenth of the base model's data budget to unlock parallel generation. Because the base weights never move, the model retains the behavior teams already validated, and the risk surface of adopting it is a bounded add-on rather than a new pretraining gamble. This is the practical difference between a research result and something a platform team will actually deploy: the quality baseline is inherited, not re-earned.

Diffusion, Speculation, and the Race for Parallel Tokens

TwoTower is one of several 2026 approaches chasing the same target from different directions. Speculative decoding, as in DeepSeek DSpark, drafts tokens with a small model and verifies them with a large one. Tri-mode diffusion, in NVIDIA's earlier Nemotron-Labs-Diffusion, switches a single checkpoint between autoregressive, diffusion, and self-speculative modes. TwoTower's contribution is to separate representation from denoising into distinct networks so the base model can stay untouched. The shared thesis across all three is that one-token-at-a-time generation is the bottleneck, and 2026 is the year the open-weights world stopped treating it as fixed.

Three matte 3D blocks of increasing height representing throughput gains from parallel token generation — Parallel generation is the common target; the architectures differ.

Impact on Creators

For anyone serving an open model in production, throughput is cost. A model that writes tokens in parallel blocks serves more requests on the same hardware, which directly lowers the price of agent loops, batch content generation, and code completion. TwoTower's numbers put a real figure on that: 2.42x wall-clock generation throughput at 98.7 percent of the baseline's aggregate quality. The tradeoff is memory. Running the full diffusion path takes 2 GPUs at about 59GB per GPU in BF16, so this is a workstation-and-up deployment, not a laptop model. Teams already invested in NVIDIA runtimes like vLLM are the natural first adopters, since the backbone and tooling are already familiar. The honest move is to benchmark against your own workload rather than the published aggregate, because the speedup depends on your batch sizes and sequence lengths.

A matte 3D server rack with an orange accent representing self-hosted inference cost — Throughput gains translate directly into lower per-request inference cost.

Key Takeaways

TwoTower reaches 2.42x generation throughput at 98.7 percent quality by adding a trained denoiser to a frozen autoregressive backbone, not by training a new model.
The denoiser trained on about 2.1 trillion tokens versus the backbone's 25 trillion, roughly a tenth of the data budget to unlock parallel generation.
The architecture is two 52-layer towers of Mamba-2, attention, and MoE blocks, near 60B total parameters with about 3B active per token per tower.
It needs 2 GPUs at roughly 59GB each in BF16, positioning it as a workstation-class deployment.
The license permits commercial use, and the weights are downloadable now from the NVIDIA Hugging Face collection.

What to Watch

The obvious next step is quantization. A 2-GPU BF16 requirement is the current memory bar, and quantized variants typically follow open-weight releases within days, lowering that bar and widening who can run the model locally. Beyond that, the frozen-backbone pattern is the interesting thread to track: if a denoiser can be trained on a tenth of the data to unlock parallel generation on top of any existing Nemotron checkpoint, the same approach could be applied across a model family without re-pretraining each one. Watch whether NVIDIA extends TwoTower to larger backbones and whether other labs adopt the separate-tower design over the single-checkpoint tri-mode route. The question that matters for builders is no longer whether parallel generation works, but which of these competing architectures becomes the default way to serve open models cheaply.

Frequently Asked Questions

What is NVIDIA Nemotron-Labs-TwoTower?

It is an open-weight diffusion language model from NVIDIA that generates text 2.42 times faster than a standard autoregressive baseline while retaining 98.7 percent of its benchmark quality. It is available on Hugging Face under a license permitting commercial use.

How does TwoTower differ from a normal diffusion LLM?

Most diffusion LLMs are trained from scratch. TwoTower keeps an existing autoregressive model frozen as a context tower and adds a separate trained denoiser tower on top, so the speedup comes without re-pretraining the base model.

What hardware does it need?

Running the full diffusion path requires 2 GPUs at roughly 59GB each in BF16. That makes it a workstation-class or server-class deployment rather than a laptop model, though quantized variants will likely lower the memory requirement.

Can I use TwoTower commercially?

Yes. The weights are released under the NVIDIA Nemotron Open Model License, which permits commercial use under its standard terms. The model is downloadable from NVIDIA's Hugging Face page.

How does it compare to speculative decoding?

Both aim to break the one-token-at-a-time bottleneck. Speculative decoding drafts with a small model and verifies with a large one, while TwoTower generates tokens in parallel through a denoiser tower attached to a frozen backbone. They are different routes to the same throughput goal.

NVIDIA Nemotron TwoTower: 2.4x Faster Open Diffusion LLM

Background

Deep Analysis

Two Towers Instead of One Model

Why a Frozen Backbone Changes the Cost Equation

Diffusion, Speculation, and the Race for Parallel Tokens

Impact on Creators

Key Takeaways

What to Watch

Frequently Asked Questions

What is NVIDIA Nemotron-Labs-TwoTower?

How does TwoTower differ from a normal diffusion LLM?

What hardware does it need?

Can I use TwoTower commercially?

How does it compare to speculative decoding?

Keep reading

pxpipe Cuts Claude Token Bills 70% by Imaging Context

Manufact Launches MCP Cloud for Claude, ChatGPT Apps

Condense Proxy Cuts Claude Code Token Bills up to 70%

Background

Deep Analysis

Two Towers Instead of One Model

Why a Frozen Backbone Changes the Cost Equation

Diffusion, Speculation, and the Race for Parallel Tokens

Impact on Creators

Key Takeaways

What to Watch

Frequently Asked Questions

What is NVIDIA Nemotron-Labs-TwoTower?

How does TwoTower differ from a normal diffusion LLM?

What hardware does it need?

Can I use TwoTower commercially?

How does it compare to speculative decoding?

Stay ahead of AI

Keep reading

pxpipe Cuts Claude Token Bills 70% by Imaging Context

Manufact Launches MCP Cloud for Claude, ChatGPT Apps

Condense Proxy Cuts Claude Code Token Bills up to 70%

Stay ahead of Creative AI