H Company released Holo 3.1 on June 1, an open-weights computer-use agent family that runs locally on consumer hardware and adds mobile control, function calling, and NVIDIA-optimized quantization. All four model sizes ship under Apache 2.0, and the 35B variant now hits 79.3% on AndroidWorld, up from 67% in Holo 3.
How to integrate this
Holo 3.1 is the first release in the family with quantized weights ready for local inference. The 35B-A3B model is available in Q4 GGUF for llama.cpp and Ollama, NVFP4 for NVIDIA RTX cards via Model Optimizer, and FP8 for vLLM. The 0.8B, 4B, and 9B variants run on Apple Silicon laptops and consumer GPUs without quantization. A managed Holo Models API is available for teams that prefer cloud inference, and the GGUF checkpoints work out of the box with any tool that already speaks the Qwen 3 architecture.
The agent ships with native function-calling support alongside the structured JSON outputs from Holo 3, so creators can plug it into existing agent stacks without writing custom output parsers. On a DGX Spark, NVFP4 delivers a 2x end-to-end speedup over BF16, dropping each agent step from 6.8 to 3.3 seconds.
Why It Matters
Computer-use agents have been gated behind cloud APIs for most of the last year, with open alternatives trailing Claude Computer Use and OpenAI Operator on real benchmarks. Holo 3.1 closes that gap on mobile in particular, where the AndroidWorld jump from 58% to 72% on the 4B model means a small local agent can now drive an Android emulator competently enough for app testing, data scraping, or accessibility automation. For creators building tools that need to click, scroll, and type across desktop, browser, and phone, Holo 3.1 is the first open option that ships quantized weights and ships them with a license that allows commercial deployment.
Key Details
The Holo 3.1 family includes four sizes: 0.8B, 4B, 9B, and 35B-A3B (a 35B mixture-of-experts with 3B active parameters). All are based on Qwen and inherit Qwen's image-text-to-text architecture. H Company is the same team behind Runner H, the autonomous browser agent that launched in late 2025. The 35B model in FP8 and NVFP4 lands within 2 points of the full-precision BF16 checkpoint on OSWorld, which is the closest open-weights computer-use model to lossless quantization yet shipped. Apache 2.0 licensing on all checkpoints, including the quantized ones, removes the commercial-use ambiguity that has dogged earlier Qwen-derived agents.
What to Do Next
Pull the Q4 GGUF checkpoint and run it through llama.cpp or Ollama on a 24GB+ GPU or an Apple Silicon laptop with 32GB+ unified memory. Test the agent against your own click-and-fill workflow before deciding whether the local agent can replace the cloud calls. If you have an RTX card, the NVFP4 build is the throughput leader, but the GGUF build is the most portable. H Company's main site has the technical report with full benchmark methodology if you need to defend the choice to a team.