OpenMOSS published the MOSS-Audio technical report on June 1, 2026, documenting four open-source audio-language models that achieve benchmark scores rivaling systems three to four times their size. The 8B variant outperforms Qwen3-Omni-30B across all measured general audio tasks and beats Gemini-3.1-Pro on timestamped speech recognition by a factor of five on English and twenty-three on Chinese.
What Happened

The OpenMOSS team at MOSI.AI and Shanghai Innovation Institute released the MOSS-Audio model family on GitHub in April 2026 with an initial set of weights. The June 1 technical report formalizes benchmarks across eight evaluation suites, making this the first comprehensive public comparison of MOSS-Audio against production-grade audio models from Alibaba, Google, and Step-AI.
Four models are now available for download:
- MOSS-Audio-4B-Instruct: Direct instruction following at 4.6B parameters
- MOSS-Audio-4B-Thinking: Chain-of-thought reasoning at 4.6B parameters
- MOSS-Audio-8B-Instruct: Direct instruction following at 8.6B parameters
- MOSS-Audio-8B-Thinking: Chain-of-thought reasoning at 8.6B parameters
Why It Matters for Creators
Audio-language models have largely been locked behind proprietary APIs or required 30B-plus parameter counts to achieve usable accuracy. MOSS-Audio changes that calculus. At 8B parameters it runs on a single 24GB consumer GPU, while delivering results competitive with closed models that cost money to access and offer no fine-tuning access.
The practical impact spans several creator workflows. Podcast editors can generate time-stamped transcripts with word-level accuracy that outperforms commercial tools. Music producers can ask the model to identify specific instruments, describe tonal qualities, or locate sections by time. Sound designers working with field recordings can caption environmental audio without manual listening sessions. Any creator working with voice content gets a local, private, customizable alternative to cloud speech APIs.
This also matters for the broader audio AI toolchain: as audio understanding models improve, pipelines that combine understanding with generation become viable. MOSS-Audio gives that pipeline an open-source anchor at the comprehension end.
Architecture: How It Works

MOSS-Audio uses a three-stage modular architecture. A dedicated audio encoder converts raw waveforms into continuous temporal representations at 12.5 Hz. A modality adapter projects those representations into the embedding space of the language model backbone. The language model then generates text autoregressively.
Two technical choices separate MOSS-Audio from prior open-source audio-language models:
DeepStack cross-layer feature injection exposes the language model decoder to acoustic features from multiple encoder depths simultaneously, rather than just the final encoder layer. This preserves both low-level acoustic detail and high-level semantic structure, which explains the model's strength on tasks requiring both acoustic precision (singing transcription, dialect recognition) and semantic understanding (audio captioning, question answering).
Time markers insert explicit timestamp tokens into the audio token stream before the language model processes them. Rather than learning temporal alignment as a post-processing step, the model treats timestamps as first-class tokens, which is why its timestamped ASR accuracy is dramatically higher than competitors that add timestamps via heuristics after generation.
Training used an event-preserving audio annotation pipeline that segments recordings at natural event boundaries and generates branch-specific captions covering speech content, music characteristics, and environmental sounds. Multi-stage post-training then optimized instruction following and chain-of-thought reasoning separately, producing the Instruct and Thinking variant pairs.
Benchmark Results
The technical report covers four benchmark categories. The numbers below compare the top MOSS-Audio variant against the best-performing competitor in each category.
| Benchmark | MOSS-Audio-8B-Thinking | Qwen3-Omni-30B | Gemini-3.1-Pro |
|---|---|---|---|
| MMAU (general audio) | 77.33 | lower | lower |
| MMAU-Pro | 64.92 | lower | lower |
| Average CER (ASR) | 11.30 | N/A | N/A |
| Timestamp ASR (LibriSpeech) | 131.61 AAS | 646.95 AAS | 871.19 AAS |
| Timestamp ASR (AISHELL-1) | 35.77 AAS | 833.66 AAS | 708.24 AAS |
AAS (Average Absolute Shift) measures how far predicted word timestamps deviate from ground truth in milliseconds. Lower is better. The 35.77 versus 833.66 gap on AISHELL-1 means MOSS-Audio places Chinese words roughly 24 times closer to their true timestamps than Qwen3-Omni-30B. On speech captioning, the 8B-Instruct variant leads in 11 of 13 fine-grained dimensions including accent, pitch, timbre, and fluency. A detailed breakdown is available in the benchmark analysis on Dev.to. The initial model release was covered by MarkTechPost when the weights first shipped in April 2026.
The step-up from 4B to 8B improves Thinking variant performance more than Instruct, suggesting the larger architecture especially benefits from chain-of-thought reasoning on complex audio tasks. The 4B-Instruct remains competitive for straightforward transcription and captioning at roughly half the VRAM requirement.
How to Run MOSS-Audio Locally

The model runs on any system with a CUDA-capable GPU and at least 12GB VRAM for the 4B variants or 24GB for the 8B. The Gradio demo app is included in the repository.
- Clone the repository:
git clone https://github.com/OpenMOSS/MOSS-Audio - Create a Python environment and install dependencies:
pip install -r requirements.txt - Download a model from HuggingFace. For the 4B Instruct variant:
huggingface-cli download OpenMOSS-Team/MOSS-Audio-4B-Instruct - Launch the Gradio demo:
python app.py --model-path OpenMOSS-Team/MOSS-Audio-4B-Instruct - Open the local URL (typically http://localhost:7860) and drag in any audio file.
For production use, the repo includes an SGLang serving configuration that supports batched inference and an OpenAI-compatible API endpoint. This allows MOSS-Audio to slot into existing audio processing pipelines without code changes if those pipelines already call a chat completions endpoint.
Creators working with local AI music tools can chain MOSS-Audio as an understanding layer. Generate audio with one model, pass it to MOSS-Audio to caption or time-stamp it, and use those structured outputs to drive editing decisions or metadata tagging automatically.
What to Do Next
- Download MOSS-Audio-4B-Instruct from HuggingFace for a quick local test on any audio file you have.
- If your workflow involves timestamped transcripts (podcast editing, captioning, subtitle sync), test the 8B-Thinking variant against your current tool on a representative sample.
- For music analysis tasks, try prompting with structured questions: "List all instruments audible between 0:30 and 1:00 and describe the dominant mood."
- Review the full technical report on arXiv (arxiv.org/abs/2606.01802) for a complete breakdown of training methodology and all benchmark suites.
Frequently Asked Questions
What is MOSS-Audio and who made it?
MOSS-Audio is an open-source audio-language model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It processes speech, environmental sounds, and music to produce text outputs including transcriptions, captions, timestamps, and reasoned answers to questions about audio content.
What is the difference between the Instruct and Thinking model variants?
Instruct variants generate direct answers to audio prompts without intermediate reasoning steps, making them faster for tasks like transcription and captioning. Thinking variants use chain-of-thought reasoning before answering, which improves accuracy on complex tasks like multi-hop audio question answering and nuanced music analysis, at the cost of higher latency and token usage.
How does MOSS-Audio compare to Whisper for speech transcription?
MOSS-Audio handles more diverse audio than Whisper, including dialect speech, code-switching, singing, and non-speech sounds, while also supporting natural language questions about audio content. For standard English transcription on clean audio, Whisper Large v3 remains highly optimized and faster. MOSS-Audio's main advantage over Whisper is timestamped ASR accuracy, music and environmental sound understanding, and the ability to answer open-ended questions about audio content.
What hardware do I need to run MOSS-Audio locally?
The 4B variants require approximately 12GB VRAM and run on GPUs like the RTX 3060 or RTX 4070. The 8B variants require approximately 24GB VRAM, suitable for an RTX 4090 or a workstation GPU. CPU inference is possible but significantly slower. The model is not quantized in the official release, though community quantizations may appear on HuggingFace.
Can MOSS-Audio generate audio, or only understand it?
MOSS-Audio is an understanding model only. It takes audio as input and produces text. For audio generation, the OpenMOSS team also maintains MOSS-TTS for speech synthesis. Combining MOSS-Audio for understanding with a generation model like Stable Audio or MOSS-TTS creates a complete audio analysis and production pipeline.
Is MOSS-Audio free for commercial use?
The model weights are available on HuggingFace under the terms of the OpenMOSS license. Check the repository's LICENSE file before commercial deployment, as open-source audio model licenses vary in their commercial use provisions.