MiniMax M3 Claims 15x Faster Decoding at 1M Tokens
Key insights
- MiniMax's M3 targets 9.7x faster prefill and 15.6x faster decode than M2 specifically at 1-million-token context lengths.
- MiniMax skipped sparse attention for M2 due to production-readiness concerns, making M3 a deliberate architectural reversal.
- M3 carries no confirmed release date beyond a general H2 2026 target, with no benchmark accuracy tradeoffs disclosed.
Why this matters
A 15.6x decode speedup at 1M-token contexts, if it holds under production load, would make ultra-long-context inference economically viable for enterprise customers who currently avoid it due to latency and cost. MiniMax's public preview also signals that the field is converging on sparse attention as production-ready in 2026, which will pressure competitors including Anthropic, Google DeepMind, and OpenAI to accelerate their own efficient-attention roadmaps. The company's candid acknowledgment that it deprioritized this approach for M2 adds credibility to the claim that the architecture is now genuinely ready, rather than a marketing-stage preview.
Summary
MiniMax has previewed the sparse attention architecture for M3, its upcoming model, claiming 9.7x faster prefilling and 15.6x faster decoding versus M2 at 1-million-token contexts.
Sparse attention reduces compute cost over long sequences by skipping unnecessary token pairs. MiniMax deliberately bypassed this for M2 over production-readiness concerns; M3 is the course correction, with the company now treating long-context efficiency as a first-class requirement rather than a post-launch optimization.
Essentially: MiniMax is betting sparse attention is mature enough for production, a calculation its own M2 timeline shows it was not confident making during that model's development.
- 9.7x prefill and 15.6x decode speedups apply specifically at 1M-token contexts, where full attention becomes computationally prohibitive.
- Decode speedup is especially notable because decoding is typically the bottleneck in production inference, not prefill.
- No release date has been confirmed for M3 beyond a general H2 2026 target.
If those numbers survive production conditions, M3 changes the cost calculus for deploying very long-context models in real applications.
Potential risks and opportunities
Risks
- If M3's sparse attention degrades output quality on long-context tasks at production scale, MiniMax risks repeating the M2 pattern and deprioritizing the approach a second time.
- Competitors (Anthropic, Google DeepMind, Mistral) may ship long-context efficiency improvements before M3's H2 2026 window, eroding the announcement's market differentiation.
- Without published accuracy benchmarks alongside speedup claims, enterprise customers cannot make informed deployment decisions, which could slow M3 adoption even after release.
Opportunities
- Inference infrastructure providers (Groq, Cerebras, Together AI) can position optimized sparse-attention serving pipelines for M3 ahead of its H2 2026 launch.
- Enterprises in legal, finance, and healthcare with large document corpora now have a concrete target model to anchor long-context pipeline planning around for late 2026.
- Open-source model developers (Qwen team, Mistral, Falcon) could use MiniMax's public benchmarks as a competitive target to accelerate their own sparse attention implementations before M3 ships.
What we don't know yet
- Whether the 9.7x prefill and 15.6x decode benchmarks were measured on academic tasks or production-representative workloads has not been disclosed.
- MiniMax has not specified whether M3's sparse attention is a custom design or based on an existing method such as sliding window attention or a flash attention variant.
- No accuracy or quality tradeoff data has been published alongside the speedup numbers, leaving the capability-efficiency balance undefined ahead of H2 2026.
Originally reported by x.com
Read the original article →Original headline: MiniMax Teases M3 Sparse Attention Architecture — 9.7x Prefill and 15.6x Decode Speedup at 1M Tokens vs. M2