Qwen 3.5 Small: Edge AI Models Run on Phones

Alibaba's Qwen team just released Qwen 3.5 Small, a series of four open-source models ranging from 0.8B to 9B parameters that run on phones, IoT devices, and browser tabs. The smallest model operates on a seven-year-old Samsung Galaxy S10E at 12 tokens per second. The largest outperforms OpenAI's gpt-oss-120B on benchmarks.

What Happened

On March 3, 2026, Alibaba launched the Qwen 3.5 Small model series with four size variants: 0.8B, 4B, 7B, and 9B parameters. All models are fully open-source and optimized specifically for edge deployment on resource-constrained hardware.

The lineup targets a gap in the AI market: high-capability models that actually run on consumer devices without cloud dependencies. Alibaba released the models simultaneously on Hugging Face, Ollama, and LM Studio, making them immediately accessible to developers and creators.

Why It Matters

For creators, local AI means three things: no API costs, no data leaving your device, and no internet requirement. A 4B model that matches the performance of models eight times its size on agent tasks opens real possibilities for building AI-powered creative workflows that run entirely on a laptop or phone. This mirrors the efficiency gains seen in IBM's Granite 4.0 Speech, which halved its parameter count while improving accuracy.

The 0.8B model runs directly in browser via WebGPU using Transformers.js. That means web applications can embed genuine language model capabilities without any backend infrastructure. Creators building interactive tools, writing assistants, or AI-enhanced websites can now ship models alongside their code.

Running AI on a Galaxy S10E from 2019 at usable speeds also signals where mobile creative tools are heading. Expect on-device AI features in photo editors, video tools, and music apps that work offline and process data locally.

Key Details

Four models: 0.8B, 4B, 7B, and 9B parameters, all open-source
4B model: Matches performance of models with 8x more parameters on agent benchmarks
9B model: Outperforms OpenAI's gpt-oss-120B despite being a fraction of the size
0.8B model: Runs in-browser via WebGPU (Transformers.js) and on a Samsung S10E at 12 tokens/second
Context window: Up to 1 million tokens, enabling processing of entire books or large codebases
Native multimodality: Built-in support for multiple input types beyond text
Hybrid attention: Uses a combination of standard and Gated DeltaNet linear attention (75% linear in the 27B variant) for efficient inference
Availability: Hugging Face, Ollama, and LM Studio from day one

What to Do Next

If you build creative tools or workflows, test the 4B model first. It offers the best balance of capability and efficiency for most creator use cases. Download it through Ollama for the simplest local setup, or grab the weights from Hugging Face for custom integrations.

For browser-based projects, experiment with the 0.8B model via Transformers.js. The WebGPU support means you can prototype AI-powered web tools without spinning up any servers.

The 1M token context window in these small models is particularly useful for creators working with long-form content, scripts, or documentation. Try feeding entire project files into the 7B or 9B model to see how it handles context-heavy creative tasks on your own hardware. For comparison with other open-source models from Chinese labs, see GLM-5's 744B parameter model running on Huawei chips.

Qwen 3.5 Small: Edge AI Models Run on Phones

What Happened

Why It Matters

Key Details

What to Do Next

Keep reading

How Creators Actually Use AI: Workflow Analysis for 2026

OpenAI Acquires Astral to Boost Codex Python Tools

Xiaomi MiMo-V2 Ships Multimodal and TTS Models

What Happened

Why It Matters

Key Details

What to Do Next

Stay ahead of AI

Keep reading

How Creators Actually Use AI: Workflow Analysis for 2026

OpenAI Acquires Astral to Boost Codex Python Tools

Xiaomi MiMo-V2 Ships Multimodal and TTS Models

Stay ahead of Creative AI