NVIDIA released Nemotron 3 Super on March 11, 2026, an open-source 120B-parameter hybrid Mamba-Transformer model that delivers 5x higher throughput than its predecessor for agentic AI workloads. The model activates only 12B parameters per inference step, making enterprise-scale AI agents practical on standard infrastructure.
What Happened
NVIDIA announced Nemotron 3 Super as a fully open model with open weights, datasets, and training recipes. The architecture uses a hybrid Latent Mixture-of-Experts (LatentMoE) design with interleaved Mamba-2 and MoE layers alongside select Attention layers. Despite having 120B total parameters, only 12B are active during any given inference pass.
The model ships with a native 1-million-token context window, designed to give AI agents long-term memory for multi-step reasoning tasks. NVIDIA positions this as a solution to the "context explosion" problem that limits current agent architectures. On benchmarks, Super matches or exceeds models several times its effective compute cost.
Availability is broad from day one. The model is accessible through build.nvidia.com, Hugging Face, OpenRouter, and Perplexity. Cloud partners include Google Cloud Vertex AI, Oracle Cloud, CoreWeave, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, and Modal.
Why It Matters for Creators
The 5x throughput improvement directly affects anyone building AI-powered creative workflows. If you run multi-agent pipelines where one model handles text, another handles image prompts, and a third handles code, Nemotron 3 Super can potentially replace all three with a single model at lower cost. The 1M context window means agents can hold entire project briefs, style guides, and revision histories in memory without losing context.
The open-source release with full training recipes means studios and independent developers can fine-tune Nemotron 3 Super for domain-specific creative tasks. A visual effects studio could train it on their pipeline documentation. A game developer could customize it for their engine's API. No licensing fees, no vendor lock-in.
What to Do Next
Download the model weights from Hugging Face or try it through NVIDIA's API. If you are building agentic workflows, the FP8 quantized version runs efficiently on a single A100 or H100 GPU. Read the technical report for architecture details and fine-tuning recipes. Watch NVIDIA's GTC 2026 keynote on March 16 for more announcements building on this model family.
This story was covered by Creative AI News.
Subscribe for free to get the weekly digest every Tuesday.