NVIDIA shipped Nemotron 3.5 ASR on June 4, 2026: a 600-million-parameter streaming speech recognition model covering 40 language-locales from a single checkpoint, with configurable latency as low as 80 milliseconds. The weights are open under the OpenMDW 1.1 license, and the model is already live on Baseten, DeepInfra, Eigen AI, fal, ModelScope, and Together AI.
Try It: Drop Nemotron 3.5 ASR Into a Voice Pipeline Today
If you already use the NeMo framework, you can have multilingual transcription running locally in about ten minutes. Clone the NVIDIA-NeMo repo, point the cache-aware streaming inference script at your audio manifest, and set target_lang=auto to let the model detect the language per clip, or pin a locale like es-ES for cleaner output on known content. The att_context_size parameter trades latency for accuracy at inference time without retraining, so the same checkpoint serves an 80ms voice agent and a 1.12-second batch transcription job. For hosted inference without managing GPUs, the model is one click away on build.nvidia.com.
Why It Matters
Most open ASR models still ship as English-only checkpoints or force you to swap weights per language. A single 600M checkpoint that handles English, Spanish, Mandarin, Arabic, Japanese, Korean, Hindi, Thai, and 32 other locales removes a real production headache for creators building multilingual captioning, podcast indexing, or voice agents. The cache-aware streaming design means each new audio chunk reuses prior context instead of recomputing the overlap, which is what makes the sub-100ms latency target realistic on a single consumer GPU rather than a server cluster.
Key Details
Architecture is FastConformer-RNNT with cache-aware streaming, and native punctuation plus capitalization are baked into the output so transcripts arrive readable instead of as a wall of lowercase tokens. Latency scales from 80 milliseconds at att_context_size=[56,3] up to 1.12 seconds for higher accuracy. The license, OpenMDW 1.1, permits commercial use and redistribution. Fine-tuning recipes for adapting the model to a specific accent, language, or domain ship in the nvidia-riva tutorials repo. NVIDIA also flagged gRPC streaming via NIM as coming later in June.
What to Do Next
If you ship voice agents, podcast tools, or live-captioning features, swap your current English-only Whisper or wav2vec checkpoint for Nemotron 3.5 ASR in a staging branch and measure word error rate on your own clips before committing. Creators who maintain content in more than one language get the biggest jump: one checkpoint replaces a per-language router and cuts deployment cost. For everyone else, the 80ms streaming mode opens the door to real-time UX patterns that were previously priced out at the open-weights tier.