Alibaba's Qwen team just released Qwen 3.5 Small, a series of four open-source models ranging from 0.8B to 9B parameters that run on phones, IoT devices, and browser tabs. The smallest model operates on a seven-year-old Samsung Galaxy S10E at 12 tokens per second. The largest outperforms OpenAI's gpt-oss-120B on benchmarks.
What Happened
On March 3, 2026, Alibaba launched the Qwen 3.5 Small model series with four size variants: 0.8B, 4B, 7B, and 9B parameters. All models are fully open-source and optimized specifically for edge deployment on resource-constrained hardware.
The lineup targets a gap in the AI market: high-capability models that actually run on consumer devices without cloud dependencies. Alibaba released the models simultaneously on Hugging Face, Ollama, and LM Studio, making them immediately accessible to developers and creators.
Why It Matters
For creators, local AI means three things: no API costs, no data leaving your device, and no internet requirement. A 4B model that matches the performance of models eight times its size on agent tasks opens real possibilities for building AI-powered creative workflows that run entirely on a laptop or phone. This mirrors the efficiency gains seen in IBM's Granite 4.0 Speech, which halved its parameter count while improving accuracy.
The 0.8B model runs directly in browser via WebGPU using Transformers.js. That means web applications can embed genuine language model capabilities without any backend infrastructure. Creators building interactive tools, writing assistants, or AI-enhanced websites can now ship models alongside their code.
Running AI on a Galaxy S10E from 2019 at usable speeds also signals where mobile creative tools are heading. Expect on-device AI features in photo editors, video tools, and music apps that work offline and process data locally.
Key Details
- Four models: 0.8B, 4B, 7B, and 9B parameters, all open-source
- 4B model: Matches performance of models with 8x more parameters on agent benchmarks
- 9B model: Outperforms OpenAI's gpt-oss-120B despite being a fraction of the size
- 0.8B model: Runs in-browser via WebGPU (Transformers.js) and on a Samsung S10E at 12 tokens/second
- Context window: Up to 1 million tokens, enabling processing of entire books or large codebases
- Native multimodality: Built-in support for multiple input types beyond text
- Hybrid attention: Uses a combination of standard and Gated DeltaNet linear attention (75% linear in the 27B variant) for efficient inference
- Availability: Hugging Face, Ollama, and LM Studio from day one
What to Do Next
If you build creative tools or workflows, test the 4B model first. It offers the best balance of capability and efficiency for most creator use cases. Download it through Ollama for the simplest local setup, or grab the weights from Hugging Face for custom integrations.
For browser-based projects, experiment with the 0.8B model via Transformers.js. The WebGPU support means you can prototype AI-powered web tools without spinning up any servers.
The 1M token context window in these small models is particularly useful for creators working with long-form content, scripts, or documentation. Try feeding entire project files into the 7B or 9B model to see how it handles context-heavy creative tasks on your own hardware. For comparison with other open-source models from Chinese labs, see GLM-5's 744B parameter model running on Huawei chips.