NVIDIA released Nemotron 3 Ultra on June 1, 2026, a 550B mixture-of-experts model with 55B active parameters, open weights on Hugging Face. NVIDIA claims it is the "most intelligent US open weights model" with 5x faster inference and 30% lower cost than the previous Nemotron generation.
What Happened
Jensen Huang announced Nemotron 3 Ultra during the GTC Taipei keynote, framing it as NVIDIA's direct answer to DeepSeek V4 and Qwen 3.7 Max in the open-weights race. The model is available immediately on the NVIDIA Hugging Face org under a commercial-friendly license.
The MoE architecture activates 55B parameters per token from a 550B parameter pool, giving frontier-quality reasoning at the inference cost of a 55B dense model. NVIDIA cites 5x throughput gains over Nemotron 2 70B and 30% lower per-token cost on Blackwell-class hardware.
The release ships with NIM container support, vLLM compatibility, and NVFP4 quantized checkpoints that fit on a single B200 GPU.
Why It Matters for Creators
For creators using LLMs in their pipelines (script generation, asset tagging, captioning, agentic workflows), Nemotron 3 Ultra slots in as a self-hostable alternative to GPT-5 and Claude Opus 4.8 at frontier quality. The 30% cost reduction matters most for high-volume work: video transcription, batch caption generation, and asset library tagging that currently routes through paid APIs.
The NVFP4 checkpoint is the practical headline. NVFP4 fits the full 550B MoE on a single Blackwell B200 with room for context, which means small studios with one workstation card (or rented Lambda Labs hourly) can run frontier inference without multi-GPU orchestration.
Key Details
Architecture: 550B MoE, 55B active per token
License: Commercial-friendly open weights
Download: huggingface.co/nvidia
Performance: 5x faster inference vs Nemotron 2 70B
Cost: 30% lower per-token on Blackwell
Quantization: NVFP4 checkpoint fits single B200
Serving: NIM containers, vLLM, TensorRT-LLM
How to Try It
Three paths to test it this week. The fastest is the NVIDIA NIM-hosted endpoint on build.nvidia.com for free in-browser inference. Second is a Lambda Labs B200 instance ($3-4/hr) with the NVFP4 checkpoint and vLLM. Third, if you already run an Ollama or LM Studio setup, the FP8 community quants will appear on Hugging Face within days; expect a 4-bit GGUF that runs on dual RTX 6000 Ada (96GB total) for a self-hosted dev workstation.
For pipeline integrators, the vLLM OpenAI-compatible endpoint means dropping Nemotron 3 Ultra in as a replacement for any GPT-4 class API client is a config change, not a code change.
This story was covered by Creative AI News.
Subscribe for free to get the weekly digest every Tuesday.