French AI lab H Company released Holo 3.1 on June 1 as an Apache 2.0 vision-language family built specifically for computer-use agents. The 4B variant runs on a single 12GB consumer GPU, the 35B-A3B mixture variant hits 79.3 percent on AndroidWorld and 74.2 percent on OS-World, and the function-calling protocol slots into existing agent stacks without a wrapper rewrite. This tutorial walks through deploying the 4B model locally with vLLM, wiring it into a simple desktop-agent loop, and benchmarking step latency against a hosted alternative. Expect roughly 30 minutes from clone to first agent step on a clean Ubuntu 22.04 or 24.04 box with an RTX 3060 12GB or better. The cost is the electricity for the GPU; no API spend.

What You Need

  • GPU: NVIDIA RTX 3060 12GB or better. RTX 4090, A6000, and DGX Spark also tested in H Company's writeup. 16GB recommended if you also run a browser automation stack on the same box.
  • OS: Ubuntu 22.04 or 24.04 with CUDA 12.4+ drivers. Windows WSL2 works but adds 5 to 10 percent latency.
  • Python: 3.10 or 3.11. Avoid 3.12 for vLLM compatibility today.
  • Disk: ~10 GB free for the 4B FP8 weights, ~70 GB for the 35B-A3B GGUF if you scale up later.
  • HuggingFace account: free, with a read token for gated repo access.
  • Optional: Docker for the Docker Model Runner path, or LM Studio for a GUI-first workflow.

The Workflow

Step 1: Pull the 4B model from HuggingFace

Pull the FP8 weights for the 4B variant. The 0.8B variant works for low-VRAM testing; jump to 9B or 35B-A3B if your GPU has the headroom.

Install the HuggingFace CLI and pull the model. The 4B repo is the sweet-spot variant for a single 12GB card; the 0.8B variant is the fallback for laptops, and the 35B-A3B GGUF is the production-grade variant for a 24GB or 48GB card.

Terminal showing huggingface-cli download command pulling Holo 3.1 4B weights from H Company's HuggingFace repo
pip install -U huggingface_hub
huggingface-cli login   # paste your read token
huggingface-cli download Hcompany/Holo-3.1-4B \
    --local-dir ./holo-3.1-4b \
    --local-dir-use-symlinks False

Expected output: a populated ./holo-3.1-4b directory with the FP8 safetensors, the tokenizer, and the model config. Download is roughly 8 GB.

Step 2: Serve the model with vLLM

vLLM exposes an OpenAI-compatible endpoint at http://localhost:8000/v1, which is what makes the function-calling protocol drop into existing agent stacks.

Install vLLM and serve the model. The vLLM docs are the source of truth for flags; the launch below works on a clean install.

pip install -U vllm
vllm serve ./holo-3.1-4b \
    --port 8000 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Expected output: a startup log ending with Uvicorn running on http://0.0.0.0:8000. VRAM should sit around 10 GB on a 12 GB card; raise --gpu-memory-utilization only if you have no other workload on the same GPU. The OpenAI-compatible endpoint is what makes the next step drop into any existing agent loop.

Step 3: Wire it into a desktop-agent loop

A minimal screenshot-to-action loop. Replace pyautogui with Playwright for browser-only workflows or Frida for mobile control.

The agent loop is three calls: capture the screen, send the screenshot plus a goal to Holo, dispatch the returned action through an input library. The snippet below uses the OpenAI Python client pointed at the local vLLM endpoint, plus pyautogui for input and mss for screen capture.

vLLM server launch command and startup log showing Holo 3.1 4B loaded on a 12GB GPU with the OpenAI-compatible endpoint live on port 8000
from openai import OpenAI
import mss, base64, pyautogui, time

client = OpenAI(api_key="local", base_url="http://localhost:8000/v1")

def screenshot_b64():
    with mss.mss() as sct:
        png = sct.shot(output="step.png")
    return base64.b64encode(open(png, "rb").read()).decode()

def run(goal, max_steps=20):
    for step in range(max_steps):
        img = screenshot_b64()
        resp = client.chat.completions.create(
            model="holo-3.1-4b",
            messages=[
                {"role": "system", "content": "You are a desktop agent. Return one action per step."},
                {"role": "user", "content": [
                    {"type": "text", "text": goal},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}}
                ]}
            ],
            tools=[{"type": "function", "function": {"name": "click", "parameters": {"type": "object", "properties": {"x": {"type": "integer"}, "y": {"type": "integer"}}, "required": ["x", "y"]}}}],
            tool_choice="auto"
        )
        msg = resp.choices[0].message
        if not msg.tool_calls:
            print("Done:", msg.content); return
        for call in msg.tool_calls:
            args = eval(call.function.arguments)
            if call.function.name == "click":
                pyautogui.click(args["x"], args["y"])
        time.sleep(0.5)

run("Open the Files app and create a folder called holo-test")

Expected output: the script prints one action per step, executes the click, and the agent reaches the goal in five to fifteen steps. Add type, scroll, and key tool definitions for the full action space.

Step 4: Benchmark step latency and accuracy

H Company reports step time dropped from 6.8s on Holo 3.0 to 3.3s on Holo 3.1, with NVFP4 quantization delivering 1.41x throughput on DGX Spark.

Run the same goal on Holo and on whichever hosted agent your team currently pays for. Capture three numbers per run: average step time, step count to completion, and cost. The H Company technical writeup reports the average step time on Holo 3.1 dropped from 6.8 seconds on Holo 3.0 to 3.3 seconds, with NVFP4 quantization delivering 1.41 times the throughput of FP8 on DGX Spark hardware with only a two-point OS-World accuracy drop.

Python code snippet showing a minimal agent loop that captures a screenshot, sends it to Holo 3.1 via the OpenAI client, and dispatches the returned action through pyautogui

On a single 12GB card the 4B variant typically lands in the 0.8 to 1.2 second per-step range for image-only inputs, which is competitive with hosted Claude Computer Use on latency. The cost line is where local wins: a Holo step costs the electricity for one second of GPU time, while hosted agent APIs price each step in tens of cents once vision tokens and tool calls are billed.

Troubleshooting

  • vLLM crashes with OOM on a 12GB card: lower --gpu-memory-utilization to 0.75 and --max-model-len to 4096. If the crash persists, drop to the 0.8B variant.
  • Tool calls return malformed JSON: confirm the --tool-call-parser hermes flag is set. Without it vLLM emits raw text and the OpenAI client cannot parse tool calls.
  • Latency above 2 seconds per step on a 4090: check that flash-attention is installed (pip install flash-attn --no-build-isolation) and that vLLM is not using the eager-mode fallback. The startup log will say which kernel path is active.
  • HuggingFace download stalls: switch to the Hcompany mirror via HF_ENDPOINT=https://hf-mirror.com or use the Ollama path with a community GGUF quantization.
  • Action loop clicks the wrong coordinates: confirm screen DPI scaling is set to 100 percent and that mss captures the same resolution the model sees. Mismatched DPI is the most common silent failure on Windows WSL2 hosts.

What to Try Next

For browser-only automation, swap pyautogui for Playwright and feed Holo the rendered viewport rather than the full desktop. For Android control, point the loop at an emulator screenshot and use ADB instead of pyautogui; the 35B-A3B variant is the one that lands the headline 79.3 percent AndroidWorld score. For larger workflows, scale to the 9B variant on a 24GB card or run the 35B-A3B GGUF through llama.cpp on Apple Silicon. The Apache 2.0 license means commercial deployment is fine without a separate license conversation, which is a meaningful contrast against hosted alternatives that price each step.

Benchmark comparison chart showing Holo 3.1 4B at 3.3 second average step time on a single 12GB GPU versus 6.8 seconds for Holo 3.0 and per-step pricing for hosted Claude Computer Use

Frequently Asked Questions

Can Holo 3.1 replace Claude Computer Use or OpenAI Operator entirely?

For browser-and-desktop workflows that fit inside the 79.3 percent AndroidWorld and 74.2 percent OS-World accuracy envelope, yes. The same swap-the-endpoint pattern works for teams that have moved off hosted agents to OpenAI Codex computer-use on Windows or any other OpenAI-compatible endpoint. The function-calling protocol slots into existing agent stacks without a wrapper rewrite, so the swap is a configuration change rather than a workflow rebuild. For workflows that require the longest-context reasoning or the most aggressive frontier-model performance, hosted models still hold an edge.

What is the minimum GPU for Holo 3.1?

The 0.8B variant runs on 6 GB of VRAM and is the laptop-tier starting point. The 4B variant on a 12GB card is the production-grade local default. The 9B variant needs 16 GB and the 35B-A3B GGUF needs 24 GB or 48 GB depending on quantization.

Does the Apache 2.0 license cover commercial use?

Yes. Apache 2.0 permits commercial use, modification, and redistribution. The license is attached to all four model sizes and to the published weights.

How does Holo 3.1 compare to Qwen3.5 and Kimi-K2.5 on the same benches?

H Company's reported numbers put Holo 3.1 35B-A3B ahead of Qwen3.5 and Kimi-K2.5 on AndroidWorld and OS-World. Independent verification on the live benches is what matters for production deployment; rerun the eval suite on your own workflows before betting a critical pipeline.

Can I run Holo 3.1 through Ollama or LM Studio?

Yes for the GGUF variants. Community quantizations land on Ollama and LM Studio within days of major releases. The 35B-A3B GGUF is the variant that typically appears first because it benefits the most from CPU-plus-GPU split inference.