Qwen3.5-Omni: Text, Image, Audio, Video in One AI Model

Alibaba's Qwen team released Qwen3.5-Omni on March 30, a multimodal AI model that processes text, images, audio, and video while generating real-time speech output in 36 languages. The model ships in three variants (Plus, Flash, and Light) and supports a 256K token context window, covering over 10 hours of audio or roughly 400 seconds of 720p video.

For the broader landscape, see our complete guide to AI video generation in 2026.

What Happened

Qwen3.5-Omni is a significant upgrade from last year's Qwen3-Omni. Speech recognition now covers 113 languages and dialects, up from 19 in the predecessor. Speech generation expanded from 10 languages to 36, with voice cloning available via API. The model was trained on over 100 million hours of audio-visual data.

The Plus variant achieved 215 state-of-the-art results across audio, audio-visual understanding, reasoning, and interaction benchmarks.

Why It Matters

For creators working across media formats, a single model that understands and generates text, images, audio, and video removes the friction of chaining specialized tools. Voice cloning and semantic interruption (the ability to distinguish meaningful interruptions from background noise) make Qwen3.5-Omni particularly relevant for voice-first creative workflows and real-time interaction.

Qwen3.5-Omni outperformed Gemini 3.1 Flash Live on general audio understanding, reasoning, and translation, while matching it on audio-visual comprehension. It also beat ElevenLabs, GPT-Audio, and MiniMax on multilingual voice stability across 20 languages, positioning it as a serious contender in the speech AI space alongside Mistral's Voxtral TTS.

Key Details

Three variants: Plus (highest quality), Flash (balanced speed and quality), Light (lowest latency)
Context window: 256,000 tokens
Speech recognition: 113 languages and dialects
Speech output: 36 languages with voice cloning
New features: Semantic interruption, ARIA for accurate number and word pronunciation, native web search and function calling
Architecture: MoE-based Thinker-Talker design for low-latency streaming
Open-source base: Qwen3-Omni models remain available on GitHub under open licenses

What to Do Next

Test Qwen3.5-Omni through Hugging Face demos or the Alibaba Cloud API. The Plus variant is best for production-quality voice and multimodal work, while Light suits real-time conversational applications. Creators building multilingual content pipelines, voice applications, or video analysis workflows should evaluate the model against their current toolchain, particularly if they are already working with the Qwen 3.5 Small models for edge deployment.

Qwen3.5-Omni Handles Text, Image, Audio, Video in One Model

What Happened

Why It Matters

Key Details

What to Do Next

Keep reading

Luma Ray3.2 Adds Keyframe Control and HDR Video

Gemini 3.5 Live Translate: 70+ Languages, Real Time

OpenCV 5.0 Turns Vision Into a Local AI Runtime

What Happened

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

Luma Ray3.2 Adds Keyframe Control and HDR Video

Gemini 3.5 Live Translate: 70+ Languages, Real Time

OpenCV 5.0 Turns Vision Into a Local AI Runtime

Stay ahead of Creative AI