mistral.rs v0.8.2 landed on June 1, 2026 as the latest release of the open-source Rust inference engine. The update delivers substantial GPU performance improvements and a set of agentic tool-calling features that matter for creators running local language models in their workflows.

What Happened

The v0.8.2 release of mistral.rs introduces two categories of improvements: GPU kernel optimizations that accelerate inference for mixture-of-experts models, and agentic infrastructure that makes it easier to run tool-calling workflows locally.

On the performance side, Gemma 4 on quantized CUDA now sees 3.5 to 5.5 times faster prefill through optimized MoE kernels, with roughly 10% faster decode via fused kernel operations. Apple Silicon also receives Metal optimizations in this release.

Why It Matters

Mixture-of-experts models like Gemma 4 and Mistral MoE activate only a subset of parameters per token, making them efficient at inference time. But that efficiency only holds when the routing and computation layers are properly fused at the GPU level. The v0.8.2 kernel improvements close a meaningful gap between what these models can theoretically do and what they actually deliver on consumer hardware.

For creators building local AI pipelines, faster prefill means shorter wait times when processing long prompts or document batches. The tool-calling additions make mistral.rs a more complete backend for agentic workflows that chain model outputs into downstream actions.

Key Details

  • 3.5-5.5x faster MoE prefill on quantized CUDA (Gemma 4 and similar models)
  • 10% faster decode through fused MoE decode kernels
  • Tool call strict mode for more reliable agentic behavior
  • Mid-stream grammar enforcement for structured tool call outputs
  • Code execution sandboxing with file output via /v1/files endpoint
  • Agentic presets via CLI for common workflow configurations
  • HF_HUB_OFFLINE support for loading pre-downloaded models with no network access
  • MTP speculative decoding for Gemma 4

mistral.rs supports models from the Mistral family, Gemma, Llama, and others. Hardware targets include NVIDIA CUDA, Apple Metal, and CPU inference.

What to Do Next

Update via the GitHub release notes at github.com/EricLBuehler/mistral.rs. If you run Gemma 4 or any MoE model on CUDA, the prefill improvement alone is worth the upgrade. The offline model loading flag is useful for air-gapped or low-bandwidth setups where pulling from Hugging Face on every run is not viable.