Xiaomi's MiMo hits 1,000 tokens per second — on regular cloud GPUs

By: Anton Kratiuk | 09.06.2026, 10:24

Xiaomi has pushed its MiMo-V2.5-Pro language model past 1,000 tokens per second with a new UltraSpeed mode — roughly 15 times faster than ChatGPT — and done it on standard, rentable cloud GPUs rather than custom silicon. The milestone matters because no one has hit that speed at the trillion-parameter scale before without dedicated hardware. For developers building real-time applications, that gap between "fast enough" and "instant" is the difference between a useful tool and a frustrating one.

The speed

The previous MiMo model, MiMo-V2-Flash, launched in December 2025 at around 150 tokens per second — already quicker than anyone reads. UltraSpeed pushes that to a sustained 1,000 tokens per second, peaking near 1,200, according to Xiaomi MiMo's official announcement. That is 6.7 times faster than its predecessor and well ahead of current public benchmarks: GPT-5.5 runs at roughly 68 tokens per second, Claude Opus at 71, and Gemini Flash at 192.

Three techniques make it work together: FP4 expert quantization (compressing the model's calculations without gutting accuracy), DFlash speculative decoding (predicting multiple tokens ahead in parallel), and TileRT runtime optimization, co-developed with inference startup TileRT. Both the FP4-DFlash checkpoint and TileRT modules are open-sourced — on Hugging Face and GitHub respectively — so teams can self-host and test independently.

The catch

Speed costs money. UltraSpeed is priced at 3× the standard MiMo-V2.5-Pro rate, putting it at roughly $1.29 per million input tokens and $2.61 per million output tokens. That matches pressure points from Groq but without requiring proprietary chips, as PhoneWorld notes. The obvious use cases are fraud detection, algorithmic trading, and real-time translation — workflows where latency has a direct dollar cost.

Access during the trial period (June 9–23, 2026) is application-gated and prioritizes enterprise clients and professional developers with concrete use cases. Approved users get two weeks free, with daily limits: 10 queued requests per account, 30-minute session caps, and automatic disconnection after 5 minutes idle. The Token Plan is not supported in UltraSpeed mode. No US- or UK-specific API pricing or regional infrastructure has been announced beyond the trial.

What to watch

All speed figures come from Xiaomi's own benchmarks — no independent third-party verification has been published yet. The open-source checkpoint on Hugging Face should generate community testing quickly. Acceptance rates also dip in open-ended conversation compared to coding tasks, so real-world production performance across general applications remains to be seen.

Artificial Intelligence AI / Neural Networks Xiaomi