Liquid AI quietly dropped LFM2.5-8B-A1B on Hugging Face on May 28, the first reasoning-tuned mixture-of-experts model in the LFM2.5 family. It has 8.3B total parameters but only activates 1.5B per token, runs a 131,072-token context, and ships as native, GGUF, ONNX, and four MLX quantizations on day one.

No marketing blog post accompanied the release, which MarkTechPost spotted first. The model card on Hugging Face is the canonical source.

What this enables for creators

LFM2.5-8B-A1B is built for agentic workflows that run on the device, not in the cloud. The model card explicitly recommends it for tool use, structured outputs, multilingual assistants, and on-device personal-assistant applications.

Practical setup paths:

  • Mac: pull the MLX 4-bit build from the LiquidAI repo and follow the MLX deployment guide. The 4-bit variant fits in roughly 5 GB and runs fast on M-series chips.
  • Windows or Linux laptop: use the GGUF build inside LM Studio or llama.cpp. Tool calling works out of the box once you wire JSON function definitions into the system prompt.
  • Server: vLLM and SGLang give OpenAI-compatible endpoints if you want to power a multi-user creator app from a single H100.

Why It Matters

The headline number is the AA-Omniscience Non-Hallucination Rate, which jumps from 7.46 on the original LFM2-8B-A1B to 63.47 on this 2.5 release. IFEval rises from 79.44 to 91.84, and MATH500 climbs from 74.80 to 88.76. For creators building local agents to research, summarize source material, or call internal APIs, hallucination rates matter more than raw benchmark wins. A model that admits when it does not know is a model you can leave running unattended.

Key Details

The architecture stacks 24 layers: 18 double-gated short-convolution blocks and 6 grouped-query attention blocks. Pretraining scaled from 12T tokens on the previous generation to 38T tokens here. Tool calls use a Pythonic syntax wrapped in <|tool_call_start|> and <|tool_call_end|> markers, similar to recent open-weights tool-calling releases. Liquid AI lists English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish as supported languages. License is "lfm1.0" with terms posted on the model card. Peak throughput on a single H100 hits 18.5K output tokens per second at high concurrency, which puts it in serious production territory for self-hosters.

What to Do Next

If you already run local models, the GGUF build slots into your existing llama.cpp or Ollama setup without changes. If you have an Apple Silicon Mac, pull the MLX 4-bit variant for the lowest memory footprint, or the bf16 version for highest quality. Creators experimenting with on-device agent workflows should compare against existing options before deciding whether to swap. The tool-calling spec is the differentiator: test it against an MCP server or your custom function set first.