OpenBMB on May 25 released MiniCPM5-1B, a 1.08B parameter dense transformer the team claims is the new state of the art for open-source models in its size class. The model runs on CPU, supports a 131K token context, ships under Apache 2.0, and beats the 2B-scale Qwen3.5-2B on small-model benchmarks.

Try It: One-Line Local Deploy

The Apache 2.0 license and the GGUF release make this an unusually easy add to an existing local stack. If you already run llama.cpp, Ollama, or LM Studio, pull the GGUF weights and route any task you currently send to a 2B or 3B local model through MiniCPM5-1B for an immediate footprint cut. For Apple Silicon, the MLX 4-bit variant runs on M-series laptops without a discrete GPU. The team also shipped MiniCPM-Desk-Pet, a local-first desktop companion that reacts to coding activity from Cursor, Claude Code, and Codex.

Why It Matters

The interesting claim is not raw capability, it is the size and the deployment story. A 1B model that scores above a 2B model on the same benchmark inverts the usual scaling assumption. On the Artificial Analysis index for small models, MiniCPM5-1B scores 17.9 against Qwen3.5-2B's 16.3. The architecture is a stock LlamaForCausalLM with grouped-query attention (16 Q heads, 2 KV heads) and no custom kernels, which means it runs anywhere PyTorch or llama.cpp runs. The win comes from training, specifically a three-stage pipeline that ends in reinforcement learning plus on-policy distillation, gaining 16 points on average across math, code, and instruction-following.

Key Details

MiniCPM5-1B ships in three checkpoint flavors. The default is the post-RL release, with a base model and an SFT-only variant available for researchers who want to fine-tune from earlier stages. Tool calling is built in with an XML-style format and a native SGLang parser. Hybrid reasoning is selectable through an enable_thinking flag on the same checkpoint, so one model serves both fast chat and chain-of-thought inference. The published MiniCPM4 paper documents the on-device efficiency techniques carried into the new generation, and the GitHub repository includes the training recipe and FlagOS multi-chip backends for Nvidia, Ascend, Kunlunxin, Hygon, and Metax accelerators.

What to Do Next

Run a side-by-side test against your current local default. If you currently rely on Gemma 4 2B for local tool calling, swap in MiniCPM5-1B and measure latency, RAM, and tool-call success on the same prompts. For agentic loops where small context dominates, the 131K window and the SGLang parser are the features to compare. The model is small enough to test in an afternoon and permissively licensed enough to ship in a product the same week.