Together AI Achieves 125% Throughput Gains on MiniMax M3
Key insights
- Together AI achieved 81-125% throughput gains on MiniMax M3 by redesigning sparse attention kernels to iterate over key-value groups in the outer loop.
- MiniMax Sparse Attention delivers more than 9x prefill speedup and more than 15x decode speedup compared to prior inference approaches.
- A Rust-based Serving Model Gateway offloads CPU-heavy video preprocessing before GPU workers are invoked, enabling efficient multimodal inference at scale.
Why this matters
MiniMax M3's block-sparse attention architecture demonstrates that frontier models are diverging from the dense-attention assumptions baked into standard serving frameworks, meaning inference teams can no longer assume off-the-shelf stacks will deliver competitive performance. The 81-125% throughput gains Together AI achieved translate directly to per-token cost reduction at scale, setting a concrete benchmark for what cloud inference providers must deliver when hosting models with non-standard architectures. This post signals that competitive advantage in model serving is shifting toward deep GPU kernel expertise, particularly for providers working with models that deviate from conventional attention patterns.
Summary
Together AI on June 2 detailed how they rearchitected inference for MiniMax M3, whose 1-million-token context window and block-sparse attention mechanism exposed deep inefficiencies in conventional serving stacks.
The core fix was a KV-Block-Major kernel that iterates over key-value groups rather than queries, cutting redundant KV cache transfers at prefill. A stride-based paged attention integration added a separate 5% decode throughput gain.
Essentially: (Together AI, MiniMax) rebuilt the inference layer to make M3 production-viable at scale.
- Both optimizations combined drove 81-125% throughput gains across concurrency levels.
- A Rust-based Serving Model Gateway handles video frame extraction, resizing, and normalization before requests reach GPU workers.
- MiniMax Sparse Attention delivers more than 9x prefill and 15x decode speedup over prior approaches.
Custom kernel engineering is now the entry cost for deploying frontier models with non-standard attention patterns.
Potential risks and opportunities
Risks
- Inference providers serving MiniMax M3 without comparable custom kernels face a structural 81-125% throughput disadvantage, directly undercutting their per-token pricing competitiveness
- Custom kernel work tied to M3's specific sparse attention layout may not transfer to future model versions, requiring repeated re-engineering cycles with each architecture change
- The Rust-based SMG preprocessing gateway introduces a new bottleneck that, if under-scaled, could negate GPU throughput gains on high-volume multimodal workloads
Opportunities
- Inference providers lacking sparse-attention kernel expertise now face competitive pressure as architecturally novel models like M3 become more common in production deployments
- Developers building long-context retrieval or video understanding applications gain access to production-grade 1M-token multimodal inference through Together AI's MiniMax M3 serving infrastructure
- Model providers with non-standard attention architectures can pursue deep inference co-engineering partnerships with cloud providers to unlock throughput gains comparable to the 81-125% demonstrated here
What we don't know yet
- Whether Together AI's KV-Block-Major kernel implementation will be open-sourced or remains proprietary to their serving infrastructure
- Exact latency figures for prefill time-to-first-token at 1M-token context lengths, which the post does not disclose
- How the Rust-based SMG gateway scales under high-concurrency multimodal workloads with large video inputs, with no specific throughput benchmarks provided
Originally reported by together.ai
Read the original article →Original headline: Together AI Achieves 81–125% Throughput Gains Serving MiniMax M3 via Novel Sparse Attention and Paged Attention Kernels