MiMo V2.5 Pro UltraSpeed: 1T Model, 1000 Tokens/Sec

Xiaomi's MiMo team announced MiMo-V2.5-Pro-UltraSpeed on June 8, claiming the first 1-trillion-parameter Mixture-of-Experts model to decode at over 1,000 tokens per second on a single standard 8-GPU node. Peak throughput hits around 1,200 tokens per second. The team is also dropping the underlying FP4 checkpoint as open weights on HuggingFace and opening a free API trial from June 9 through June 23.

Try it: spin up a 1000-tps assistant in your writing or code workflow

For creators building tools on top of an LLM, decode speed is the difference between a chatbot UX and a real-time draft surface. At 1,000+ tokens per second, full essays land in a few seconds instead of half a minute. Two concrete things to do this week:

Apply for the trial at platform.xiaomimimo.com/ultraspeed. The team says they will prioritize enterprises and professional developers, so include a business case in the application.
Pull the FP4 checkpoint from HuggingFace if you have local Hopper or Blackwell capacity and want to benchmark the open-weights variant under your own proprietary inference stack.

Why it matters

1T-parameter open-weights models have historically been gated by serving cost rather than training cost. Decode throughput in the 30 to 100 tokens-per-second range on commodity nodes is where most production deployments live, and that ceiling is why agent loops feel sluggish and why long-document workflows route to streaming APIs. A 10x speedup at the same scale, on commodity hardware, would change what creators can ship as an interactive surface. The catch is that the UltraSpeed throughput depends on Xiaomi's proprietary TileRT inference system. The open-weights FP4 checkpoint is downloadable and runnable, but reproducing the headline speed outside Xiaomi's stack is not guaranteed.

Key details

The model is a 1-trillion-parameter MoE with sliding window attention. Inference uses FP4 quantization on the experts and FP8 elsewhere. The team reports that benchmark capability is "essentially on par with the original model" after quantization, though they did not publish direct comparison numbers against DeepSeek V4, GPT-5, or Claude Mythos.

The API trial runs from June 9 to June 23, 2026 in Beijing Time. Access requires an approved application. The TileRT serving stack is not open-sourced. Xiaomi has historically released MiMo checkpoints under permissive licenses, but creators integrating the FP4 checkpoint into a product should confirm the license terms on the HuggingFace page before shipping.

What to do next

If you build agent loops, voice-mode chat, or any creator tool where streaming latency shapes the UX, file an UltraSpeed trial application this week before the June 23 cutoff. If you maintain a local inference stack, download the FP4 checkpoint and benchmark it on your own hardware to separate the model's quality story from the proprietary serving story. If you publish a directory or comparison page, note that this is the first credible 1T-parameter open-weights MoE shipping with sub-second decode for a full essay, which moves the goalposts for what "real time" means in creator tooling.

Xiaomi MiMo Hits 1000 Tokens Per Second on 1T Open Model

Try it: spin up a 1000-tps assistant in your writing or code workflow

Why it matters

Key details

What to do next

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

Try it: spin up a 1000-tps assistant in your writing or code workflow

Why it matters

Key details

What to do next

Stay ahead of AI

Keep reading

ComfyUI v0.29.0 Adds HeyGen, GPT-5.6, and Gemma4 Nodes

Sessiongrep: Searchable Memory for AI Coding Agents

How to Make YouTube Thumbnails With AI (2026 Guide)

Stay ahead of Creative AI