A new open-source project called parakeet.cpp ports NVIDIA's Parakeet automatic speech recognition models to the ggml inference library, removing the Python runtime requirement entirely. The project appeared on GitHub on May 28, 2026 and supports every published Parakeet checkpoint with byte-identical output to NVIDIA's official NeMo implementation, at up to 1.86x faster throughput on CPU.

NVIDIA's Parakeet models rank among the most accurate open-source speech-to-text models available for English audio. Getting them to run outside a full NeMo and PyTorch environment has historically required significant setup. parakeet.cpp changes that by delivering a compiled binary that handles transcription, streaming, and GGUF-quantized inference with no Python dependency at runtime.

What Is NVIDIA Parakeet

NVIDIA Parakeet speech model

NVIDIA Parakeet is a family of automatic speech recognition models trained using the NeMo framework. NVIDIA distributes the checkpoints through HuggingFace and the NVIDIA NGC catalog.

The family includes several decoder architectures optimized for different use cases:

  • CTC (connectionist temporal classification): fast, streaming-friendly, models at 110M, 0.6B, and 1.1B
  • RNNT (recurrent neural network transducer): high accuracy with online decoding, 0.6B and 1.1B
  • TDT (token-and-duration transducer): NVIDIA's flagship ASR architecture, 0.6B and 1.1B including multilingual v2 and v3 variants
  • Hybrid TDT+CTC: combined decoder for flexibility, 110M and 1.1B

On standard benchmarks, the 1.1B TDT model matches or exceeds much larger general-purpose models on English transcription tasks. Until now, running any of these models required the full NeMo Python stack: dozens of dependencies and a significant environment setup.

What parakeet.cpp Delivers

parakeet.cpp is built in C++17 on top of ggml, the same tensor library that powers llama.cpp and whisper.cpp. The project is authored by mudler, who also created LocalAI, the self-hosted server that provides OpenAI-compatible endpoints for local models.

Key capabilities at launch:

  • No Python or PyTorch required at inference time
  • CPU inference with optional GPU acceleration via CUDA, Metal (Apple Silicon), Vulkan, or ROCm/HIP
  • GGUF quantization support: f32, f16, q8_0, q4_0, q5_0, q4_k, q5_k, q6_k
  • Byte-identical transcripts to NeMo across all published checkpoints (word error rate of 0 against NeMo)
  • Per-word timestamps and confidence scores in JSON output
  • Cache-aware streaming with end-of-utterance detection
  • Flat C API for embedding in other applications or language bindings

Performance: parakeet.cpp vs NeMo vs Whisper

Parakeet performance benchmarks

The parakeet.cpp README includes benchmarks against NeMo's PyTorch implementation on a 20-core x86 CPU using 8 threads:

Quantization Model size vs f32 CPU speedup vs NeMo Accuracy
f32 100% 1.11x to 1.69x (median 1.40x) Byte-identical
f16 57% up to 1.70x Near-lossless
q8_0 37% up to 1.86x Near-lossless

On GPU (tested on NVIDIA GB10), the median speedup is 1.25x over NeMo, with peaks up to 4.3x on the large TDT models. RAM usage is approximately 2x lower than NeMo across all configurations.

For comparison with Whisper: the 110M parakeet.cpp model outperforms whisper.cpp at the base.en scale. Larger Parakeet variants match or exceed Whisper's transcription accuracy on English audio while running entirely through the ggml stack.

How to Install and Run parakeet.cpp

parakeet.cpp builds from source with CMake. The following steps cover a basic CPU setup for transcription:

  1. Clone with submodules:
    git clone --recursive https://github.com/mudler/parakeet.cpp
  2. Build the project:
    cmake -B build && cmake --build build -j
    Add -DPARAKEET_GGML_CUDA=ON for NVIDIA GPU, or -DPARAKEET_GGML_METAL=ON for Apple Silicon. Vulkan and ROCm flags are also available.
  3. Download model weights:
    Get checkpoint files from NVIDIA's HuggingFace org or NGC catalog. The 0.6B TDT model is a strong starting point for most transcription tasks.
  4. Convert to GGUF:
    Run the provided Python conversion script once. This is the only step that requires Python. Inference itself runs without it. The script outputs a .gguf file.
  5. Quantize (optional):
    Use the CLI quantization tool to reduce file size. q8_0 is the recommended balance: 37% of f32 size, 1.86x faster, near-lossless accuracy.
  6. Transcribe audio:
    ./build/parakeet-cli -m model.gguf audio.wav
    Add --timestamps for per-word timing, or --json for structured output with confidence scores.
  7. Streaming (optional):
    For real-time transcription, use the streaming CLI variant with the parakeet_realtime_eou_120m-v1 model, which is specifically designed for cache-aware processing with end-of-utterance detection.

Local inference performance benefits from the full hardware stack. CUDA 13.3 delivered a 15% LLM inference speedup that applies across ggml-based tools including parakeet.cpp when running on NVIDIA hardware.

Creator Use Cases

Creator use cases for Parakeet

For audio and video creators, parakeet.cpp unlocks accurate local transcription without sending audio to an external API. Practical applications include:

  • Podcast transcription: High-accuracy transcripts from recorded interviews processed locally, faster than NeMo on the same hardware, with no usage limits or per-minute fees
  • Video captioning: Per-word timestamps with confidence scores enable frame-accurate subtitle generation without round-tripping audio through a cloud service
  • Real-time captioning: The 120M streaming model with end-of-utterance detection can drive live captioning for recorded sessions or voice note workflows embedded via the C API
  • Confidential audio processing: Client calls, financial discussions, or any audio that cannot leave your infrastructure can be transcribed entirely on local hardware
  • Multilingual workflows: The TDT v3 multilingual model extends coverage beyond English for teams working across languages

Supported Model Lineup

parakeet.cpp covers the full NVIDIA Parakeet family at launch, with all checkpoints validated at word error rate 0 against NeMo:

  • parakeet-ctc-110m, parakeet-ctc-0.6b, parakeet-ctc-1.1b
  • parakeet-rnnt-0.6b, parakeet-rnnt-1.1b
  • parakeet-tdt-0.6b-v2 (multilingual), parakeet-tdt-0.6b-v3 (multilingual), parakeet-tdt-1.1b
  • parakeet-tdt_ctc-110m, parakeet-tdt_ctc-1.1b (hybrid decoders)
  • parakeet_realtime_eou_120m-v1 (dedicated streaming model)

Frequently Asked Questions

What makes Parakeet different from Whisper?

NVIDIA Parakeet uses CTC, RNNT, and TDT decoder architectures tuned for production ASR workloads. In parakeet.cpp benchmarks, the 110M model outpaces whisper.cpp at the base.en scale, and the 1.1B variant matches or exceeds Whisper's accuracy on English audio. Parakeet is English-focused with limited multilingual coverage, whereas Whisper was trained across 99 languages. If you work primarily with English audio and want the best accuracy-per-compute tradeoff, Parakeet is worth evaluating alongside Whisper.

Do I need Python to use parakeet.cpp?

Python is only needed once: to convert NeMo checkpoint files to GGUF format. After that, all inference runs through the compiled C++ binary with no Python or PyTorch dependency. If pre-converted GGUF files become available through the community (similar to the GGUF model ecosystem around llama.cpp), even that step would be optional.

Which model size should I start with?

The 0.6B TDT model is a strong general-purpose starting point. It offers accuracy close to the 1.1B model at lower compute and memory cost. For real-time and streaming use cases, use the dedicated 120M streaming model. The 1.1B models are worth the overhead for batch transcription where accuracy matters more than speed.

Is parakeet.cpp production-ready?

The project launched May 28, 2026 with 5 commits. It is early-stage open source software. The WER 0 validation against NeMo confirms the core inference is correct, but the C API surface may change before a stable release. Treat it as beta software: suitable for local experimentation and internal tooling, with caution warranted before embedding in customer-facing pipelines.

Can I use it on Apple Silicon?

Yes. Build with -DPARAKEET_GGML_METAL=ON to enable Metal GPU acceleration on Apple Silicon Macs. CPU-only builds work on any ARM or x86 machine without additional configuration.

Where do I download model weights?

NVIDIA distributes Parakeet checkpoints through HuggingFace and the NVIDIA NGC catalog. Model weights are governed by NVIDIA's Parakeet license terms, separate from parakeet.cpp's MIT license.