NVIDIA released Nemotron 3 Ultra on June 4, 2026, a 550 billion parameter Mixture-of-Experts model that activates only 55 billion parameters per token. On agent benchmarks, the model delivers 5x faster throughput than comparable open-weight models in its class and reduces task completion costs by 30%. Weights are available on HuggingFace under the OpenMDW-1.1 open license, with API access through OpenRouter, AWS JumpStart, Google Cloud, Microsoft Foundry, and 15 additional platforms as of launch.

What Shipped: Three Models, Not One

The Nemotron 3 family launched with a full stack rather than a single model. Three components shipped together, each targeting a different layer of a production agent pipeline:

NVIDIA Nemotron 3 Ultra three model lineup
  • Nemotron 3 Ultra 550B: The flagship reasoning model built for long-running AI agents with a 1 million token context window
  • Nemotron 3.5 Content Safety 4B: A guardrail model covering 23 safety categories across 12 languages
  • Nemotron 3.5 ASR: Streaming multilingual speech recognition for 40+ languages

According to the official NVIDIA developer blog, the release targets specifically the "long-running agent" use case: multi-step pipelines that need to maintain state, reason over large document sets, or coordinate sub-tasks over extended sessions.

The Architecture Behind the Throughput Numbers

Most large reasoning models are dense: all parameters fire on every token. Nemotron 3 Ultra is a Mixture-of-Experts model, so the 550 billion total parameters describe the available pool, not the computation per forward pass. Only 55 billion parameters activate per token. That 10x parameter-to-active-parameter ratio is what makes the 5x throughput claim credible: you get reasoning depth from a 500B+ parameter pool at compute closer to a 55B dense model.

Two architectural choices compound this further. Hybrid Mamba-Transformer layers handle long sequences more efficiently than pure attention, especially above 100,000 tokens, where standard attention quadratic scaling becomes expensive. LatentMoE handles expert routing with lower overhead than standard sparse routing implementations. Together these two choices underpin the 95% score on the Ruler long-context benchmark at 1 million tokens.

NVIDIA trained the model using Multi-Teacher On-Policy Distillation (MOPD), a technique where feedback from over ten domain-specific teacher models drives continuous improvement. The quantization format is NVFP4, a 4-bit floating point precision that NVIDIA supports across its Hopper, Blackwell, and Ampere GPU families. The technical report covers MOPD methodology and the full benchmark suite in detail.

Benchmark Results for Agent Work

Three numbers from the technical report define where this model fits in the agent landscape:

Nemotron 3 Ultra 5x throughput benchmark
  • Agent Productivity (PinchBench): 91%. Tests completion of realistic productivity tasks through an agent interface.
  • Long-Horizon Planning (EnterpriseOps-Gym): 33%. Tests multi-step coordination of full enterprise processes.
  • Long Context (Ruler at 1M tokens): 95%. Tests information retrieval and reasoning across very large context windows.

The PinchBench 91% is the figure that matters most for day-to-day creative agent pipelines: it measures how reliably the model completes tasks through a tool-augmented interface. EnterpriseOps-Gym at 33% looks lower, but that benchmark tests full multi-department process orchestration, a category where all current models score under 50%. The 95% on Ruler at 1M tokens is a genuine capability milestone for any workflow requiring document-scale context.

As a comparison point, the recently released Gemma 4 12B is an efficient multimodal model at 12 billion parameters, suited for on-device or inference-constrained use cases. Nemotron 3 Ultra occupies the opposite end of the spectrum: maximum reasoning capability for cloud-hosted agent tasks where throughput cost matters more than model size.

How to Build a Long-Running Creative Agent With Nemotron 3 Ultra

The 5x throughput advantage is most visible in pipelines that make many sequential model calls. Here is a practical workflow for a creative research and generation agent using Nemotron 3 Ultra as the reasoning backbone:

Step 1: Access the model through OpenRouter or build.nvidia.com. OpenRouter provides a standard OpenAI-compatible API endpoint, so any existing agent code using the OpenAI SDK can switch models by changing the model string. The NVIDIA build platform offers a browser-based playground for testing prompts before committing to API integration.

Step 2: Load your full creative brief into the context window. With 1 million tokens available, you can include full brand guidelines, a complete content history, reference articles, style guides, and a production spec in a single prompt rather than chunking across multiple calls. This eliminates retrieval errors that occur when a model only sees partial context.

Step 3: Structure the agent loop for multi-step tasks. Nemotron 3 Ultra is specifically trained for multi-turn tool use. Define your tools (image search, text generation, asset retrieval, metadata lookup) in the system prompt, then let the model reason through a task queue. The multi-token prediction capability speeds up generation in loops that produce structured outputs like JSON or markdown.

Step 4: Add Content Safety guardrails with Nemotron 3.5 Content Safety. Run both the input prompt and the generated output through the 4B guardrail model before passing results downstream. The 23 safety categories cover brand safety concerns relevant to creative work, including misinformation, inappropriate content, and off-topic generation. The 12-language support means this holds for multilingual creative pipelines.

Step 5: Wire in Nemotron 3.5 ASR if your workflow takes voice input. For pipelines triggered by voice notes, recorded briefs, or meeting transcriptions, Nemotron 3.5 ASR provides streaming multilingual recognition across 40+ languages. The streaming mode means transcription starts before the recording ends, which reduces latency in real-time production workflows.

Where You Can Run Nemotron 3 Ultra Today

The model is available through a broad set of platforms at launch, which reduces the barrier compared to models that require direct NVIDIA infrastructure access:

Deploying Nemotron 3 Ultra
  • HuggingFace: Weights available as NVFP4 for self-hosted inference on supported NVIDIA hardware
  • OpenRouter: API access compatible with existing OpenAI SDK integrations
  • AWS JumpStart: Managed deployment with SageMaker integration
  • Google Cloud: Vertex AI deployment
  • Microsoft Foundry: Azure-native access
  • CoreWeave, Together AI, Baseten: GPU cloud providers for high-throughput use cases
  • NVIDIA build.nvidia.com: Browser-based testing playground

The NeMo GitHub repository includes tooling for fine-tuning and local evaluation. The OpenMDW-1.1 license permits commercial use with attribution, similar to the terms on recent Llama and Mistral releases.

For context on where this sits in the current agent tool landscape: OpenAI recently deprecated its no-code Agent Builder, pushing developers toward API-first agent construction. Nemotron 3 Ultra is designed for exactly that: code-driven agent pipelines rather than visual builders.

What to Do Next

The fastest way to evaluate Nemotron 3 Ultra for your workflow is to prototype a single-agent task on build.nvidia.com using a prompt you already run on another model. Compare first-call latency and output quality before moving to API integration. If you run a pipeline that makes more than 20 model calls per task, the 5x throughput difference translates to a concrete cost comparison: run the same task on Nemotron 3 Ultra and your current model, price both at their API rates, and the 30% cost savings claim either holds or does not for your specific workload.


Frequently Asked Questions

What is the difference between 550B total and 55B active parameters?

Mixture-of-Experts models store a large pool of "expert" networks but only route each token through a small subset of them. Nemotron 3 Ultra has 550 billion parameters across all experts, but any single token activates only 55 billion of them. This gives the model access to a large knowledge pool while keeping the compute per token much lower than a dense 550B model would require.

Can I run Nemotron 3 Ultra on consumer hardware?

Not currently. The NVFP4 weights require NVIDIA Hopper, Blackwell, or Ampere datacenter GPUs. The quantized format reduces memory requirements significantly compared to FP16 or BF16, but 55 billion active parameters still require multiple high-end server GPUs. For consumer-grade inference, the Nemotron 3.5 Content Safety 4B model is runnable on a single RTX-class GPU.

What is NVFP4 quantization and why does it matter?

NVFP4 is NVIDIA's 4-bit floating point precision format, supported natively on Hopper (H100/H200) and Blackwell (B100/B200) tensor cores. Quantizing to 4 bits roughly halves memory bandwidth requirements compared to INT8 and quarters it compared to BF16, which is the primary mechanism enabling the 5x throughput improvement. Quality degradation at 4-bit is minimal for reasoning tasks when the quantization is done during training rather than post-hoc.

How does the 1M token context window help in creative workflows?

A 1 million token context window holds roughly 750,000 words, equivalent to a full feature film script with all revision history, a complete brand style guide, and several hundred reference articles, simultaneously. For creative pipelines, this means you can run consistency checks, style matching, and cross-reference validation in a single call rather than chunking documents and aggregating results. The 95% Ruler benchmark score at 1M tokens indicates the model actually uses the distant context rather than ignoring it, which is the practical failure mode of many models that claim long-context support.

What is the OpenMDW-1.1 license?

OpenMDW-1.1 (Open Model Distribution and Weights License) is a permissive open license that allows commercial use, fine-tuning, and redistribution with attribution. It requires you to retain the license notice and restricts use for training competing foundation models without additional permission. For most commercial creative applications and SaaS products built on top of the model, the license is permissive enough to use without legal review.

How does Nemotron 3 Ultra compare to Claude Sonnet and GPT-4 class models?

NVIDIA's benchmark suite uses different evaluation sets than Anthropic and OpenAI publish, so direct benchmark comparisons are not available from the launch materials. The practical differentiation is deployment model: Nemotron 3 Ultra is an open-weight model with self-hosted and multi-cloud options, while Claude and GPT-4 are closed API products. The 5x throughput figure is relative to other open-weight models in the same performance class, not against proprietary APIs.


Related Deep Dives