Chinese open-source lab OpenBMB shipped MiniCPM-V 4.6 on May 11, a 1.3B-parameter vision-language model that runs natively on iPhone, Android, and HarmonyOS phones under Apache 2.0. The model accepts text, image, and video inputs and now leads the Artificial Analysis Intelligence Index for any open-weight model under 2B parameters, scoring 13 against Qwen3.5-0.8B at 10 while using 19x fewer output tokens.
What this enables for creators
Small VLMs that run offline on a phone unlock workflows that previously required a cloud API call per asset. With MiniCPM-V 4.6 loaded via Ollama or llama.cpp on a recent iPhone or Android device, a creator can batch-caption a folder of screenshots, transcribe text from frames of a long video, generate alt-text for an entire portfolio without uploading anything, or build a private asset-tagging app that processes footage on-device. The model handles up to 128 video frames per inference, which covers most short-form clips for description and OCR passes.
Why it matters
Sub-2B multimodal models that match 2B-class accuracy and run on phones change the cost economics of creator tooling. A creator who tags 10,000 thumbnails through a cloud VLM pays per image and ships the data offsite; the same job on MiniCPM-V 4.6 runs locally for free with no upload. The result is a viable open alternative to closed mobile VLMs from Google and Apple for image-understanding work that does not need a frontier reasoning model.
Key details
MiniCPM-V 4.6 is built on the SigLIP2-400M vision encoder and the Qwen3.5-0.8B language backbone. The architecture introduces a mixed 4x/16x visual token compression scheme that cuts visual encoding FLOPs by more than 50 percent and improves token throughput by roughly 1.5x compared to Qwen3.5-0.8B. The model reaches Qwen3.5-2B level scores on OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench, and leads sub-2B models on MMMU-Pro at 38 percent. Deployment surfaces include vLLM, SGLang, llama.cpp (GGUF), Transformers, and native iOS, Android, and HarmonyOS apps from the OpenBMB apps repo. Fine-tuning is supported through LLaMA-Factory and ms-swift, and the full release sits on Artificial Analysis for verified benchmarking.
What to do next
If you build creator tools, pull the GGUF weights from Hugging Face and benchmark MiniCPM-V 4.6 against whatever cloud VLM you currently call for captioning, OCR, or moderation. If you publish on mobile, test the iOS or Android demo app on a recent device and decide whether your captioning or accessibility features can move on-device. Creators who already run Liquid AI LFM2.5 or other edge models on their phones can add MiniCPM-V 4.6 as the multimodal layer of the same local stack.