NVIDIA Nemotron 3 Nano Omni: Open Multimodal Video Audio AI

NVIDIA released Nemotron 3 Nano Omni on April 28: a 30B-parameter open-weight model that takes text, images, video, and audio as native inputs and reasons across all four modalities at once. For creators working with long-form video, narrated screen recordings, or document-heavy workflows, this is the most capable open omni-model shipped to date.

What Happened

The model is built on the Nemotron 3 Nano 30B-A3B language backbone, a hybrid Mamba-Transformer-MoE architecture with 128 experts and top-6 routing. NVIDIA pairs it with the C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder, then trains the system to handle audio-visual joint reasoning natively rather than collapsing audio into transcripts first. Checkpoints ship in three quantizations on Hugging Face: BF16, FP8, and an 18B NVFP4 build for tighter deployment budgets.

Why It Matters

Open multimodal models that handle video and audio together at frontier quality have been the missing piece for creator agents. Most open vision-language models stop at image and video frames; audio gets bolted on as a separate ASR step. Nemotron 3 Nano Omni does joint reasoning, which means it can answer questions like "what is the speaker pointing at when they mention pricing" without losing the temporal alignment between voice and visuals. Benchmarks back the pitch: 72.2 on Video-MME, 74.1 on DailyOmni, 89.4 on VoiceBench, and best-in-class scores on document-heavy benchmarks like OCRBenchV2-En (65.8) and MMLongBench-Doc (57.5).

Key Details

The efficiency story is as important as the benchmarks. NVIDIA reports 7.4x higher system efficiency on multi-document workloads versus alternatives, 9.2x higher on video, and 9x higher throughput overall on multimodal tasks. The model handles audio inputs up to 1,200 seconds (20 minutes) and documents in the 100-plus page range. For grounding work, dynamic resolution lets it process visual patches from 1,024 up to 13,312 per image while keeping native aspect ratios, which matters for OCR on tables, charts, and screenshots.

Everything ships open: model weights, the technical report, training datasets, and a full implementation guide via Megatron-Bridge. NVIDIA also released the data generation recipes used to build the 11.4M synthetic QA pairs that drove the document reasoning gains, so the training pipeline itself is reproducible by other labs.

What to Do Next

If you build video or audio agent workflows, pull the BF16 checkpoint and test joint audio-visual queries on your own footage before committing to a closed-weight provider. If you work with document-heavy pipelines, the FP8 build is the sweet spot for quality and inference cost. Nemotron 3 Nano Omni continues NVIDIA's open-weights push from earlier this year; for context on where the line started, see the Nemotron 3 Super 120B release and the broader Nemotron coalition coverage.

NVIDIA Nemotron 3 Nano Omni: Open Multimodal Video Model

What Happened

Why It Matters

Key Details

What to Do Next

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

What Happened

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Gemini Generates PDFs, Excel, Slides Direct From Chat

IBM Granite 4.1: Dense LLMs Walk Back the MoE Bet

Mistral Medium 3.5: 128B Open Weights, Cloud Vibe Agents

Stay ahead of Creative AI