The llama.cpp project shipped audio multimodal support in release b8769 on April 12, adding Qwen3-Omni and Qwen3-ASR model compatibility. This marks the first time llama.cpp handles audio input alongside text and vision, enabling local speech recognition and audio understanding on consumer hardware without cloud APIs.

What Happened

Release b8769 extends llama.cpp's multimodal toolkit (mtmd) from vision-only to audio-capable. The update adds inference support for Qwen3-Omni, Alibaba's end-to-end multimodal model that processes text, images, audio, and video in a single architecture. It also supports Qwen3-ASR for dedicated automatic speech recognition. The implementation involved audio model conversion, dependency cleanup, and cross-platform testing across multiple build targets.

Why It Matters

Audio has been the missing modality in local AI workflows. Until now, running speech recognition or audio analysis locally meant separate tools and separate models. With Qwen3-Omni in llama.cpp, creators get a single model that understands speech, music, and sound effects alongside text and images. Everything runs offline on personal hardware.

This builds on the multi-GPU tensor parallelism shipped four days ago. Combined, creators with multi-GPU setups can run large multimodal models that process audio, vision, and text simultaneously with practical inference speeds.

Key Details

Supported models. Qwen3-Omni-30B-A3B-Instruct uses mixture-of-experts with 30B total and 3B active parameters per token, handling text, image, audio, and video input. Qwen3-ASR handles dedicated speech-to-text. Both run quantized on consumer GPUs with GGUF format weights.

Audio capabilities. The Qwen3-Omni model supports multilingual speech recognition across 100+ languages, audio captioning, music analysis including style and genre identification, sound effect description, and speech-to-text translation. When Qwen3.5-Omni launched in March, its speech recognition covered 113 languages and dialects.

Platform support. Binaries ship for macOS (Apple Silicon and Intel), Linux (CPU, Vulkan, ROCm, OpenVINO), Windows (CPU, CUDA 12/13, Vulkan, SYCL), and iOS via XCFramework. Run with llama-mtmd-cli for multimodal inference including audio input.

What to Do Next

Download llama.cpp b8769 from GitHub releases. Grab the Qwen3-Omni GGUF weights from HuggingFace. For multi-GPU acceleration, combine with the tensor parallelism feature from the b8750 release.