MiniCPM-V 4.6: 1.3B Vision Model Runs on Phones

Chinese open-source lab OpenBMB shipped MiniCPM-V 4.6 on May 11, a 1.3B-parameter vision-language model that runs natively on iPhone, Android, and HarmonyOS phones under Apache 2.0. The model accepts text, image, and video inputs and now leads the Artificial Analysis Intelligence Index for any open-weight model under 2B parameters, scoring 13 against Qwen3.5-0.8B at 10 while using 19x fewer output tokens.

What this enables for creators

Small VLMs that run offline on a phone unlock workflows that previously required a cloud API call per asset. With MiniCPM-V 4.6 loaded via Ollama or llama.cpp on a recent iPhone or Android device, a creator can batch-caption a folder of screenshots, transcribe text from frames of a long video, generate alt-text for an entire portfolio without uploading anything, or build a private asset-tagging app that processes footage on-device. The model handles up to 128 video frames per inference, which covers most short-form clips for description and OCR passes.

Why it matters

Sub-2B multimodal models that match 2B-class accuracy and run on phones change the cost economics of creator tooling. A creator who tags 10,000 thumbnails through a cloud VLM pays per image and ships the data offsite; the same job on MiniCPM-V 4.6 runs locally for free with no upload. The result is a viable open alternative to closed mobile VLMs from Google and Apple for image-understanding work that does not need a frontier reasoning model.

Key details

MiniCPM-V 4.6 is built on the SigLIP2-400M vision encoder and the Qwen3.5-0.8B language backbone. The architecture introduces a mixed 4x/16x visual token compression scheme that cuts visual encoding FLOPs by more than 50 percent and improves token throughput by roughly 1.5x compared to Qwen3.5-0.8B. The model reaches Qwen3.5-2B level scores on OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench, and leads sub-2B models on MMMU-Pro at 38 percent. Deployment surfaces include vLLM, SGLang, llama.cpp (GGUF), Transformers, and native iOS, Android, and HarmonyOS apps from the OpenBMB apps repo. Fine-tuning is supported through LLaMA-Factory and ms-swift, and the full release sits on Artificial Analysis for verified benchmarking.

What to do next

If you build creator tools, pull the GGUF weights from Hugging Face and benchmark MiniCPM-V 4.6 against whatever cloud VLM you currently call for captioning, OCR, or moderation. If you publish on mobile, test the iOS or Android demo app on a recent device and decide whether your captioning or accessibility features can move on-device. Creators who already run Liquid AI LFM2.5 or other edge models on their phones can add MiniCPM-V 4.6 as the multimodal layer of the same local stack.

MiniCPM-V 4.6: 1.3B Open Vision Model Runs on Phones

What this enables for creators

Why it matters

Key details

What to do next

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

What this enables for creators

Why it matters

Key details

What to do next

Stay ahead of AI

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

Stay ahead of Creative AI