huggingface.co web signal

MiniMax open-sources M3, 428B multimodal MoE with 1M context

TL;DR

  • MiniMax has released M3 on Hugging Face, a roughly 428B-parameter Mixture-of-Experts model with about 23B parameters active per token.
  • MiniMax says its new sparse attention operator delivers a 9× prefill and 15× decode speedup over M2 at 1M-token context.
  • M3 is natively multimodal across text, image, and video, with reported scores of 80.5 on SWE-bench Verified and 85.4 on Video-MME v2.

A new open-weight release from MiniMax landed on Hugging Face this month, and the part worth paying attention to is not the parameter count, it's what the company says it did to attention itself. The MiniMax-M3 model page describes M3 as a roughly 428 billion parameter Mixture-of-Experts model with about 23 billion parameters active per token, natively multimodal across text, image, and video, with a 1 million token context window.

The headline architectural claim is something MiniMax calls MiniMax Sparse Attention, or MSA. Instead of running full attention across a million-token context, MSA uses a pre-filtering stage that identifies relevant context blocks and attends only to those. The reported numbers are striking: at 1M context, MiniMax says M3 delivers a 9× prefill speedup and a 15× decode speedup compared to its previous model M2, reducing per-token compute to roughly 1/20. If those numbers hold up in independent serving, the cost picture for long-document and agentic workloads on open weights changes meaningfully.

On the model's own benchmark table, MiniMax reports 80.5 on SWE-bench Verified, 59 on SWE-bench Pro, 78.1 on MMMU Pro Standard, and 85.4 on Video-MME v2. The model exposes three reasoning modes through a thinking parameter (enabled, adaptive, disabled), so the same checkpoint can be tuned between latency and reasoning depth without swapping weights. Support for SGLang, vLLM, Transformers, and quantized runtimes including llama.cpp and Ollama is listed at launch.

The honest caveat is that almost every specific number here comes from MiniMax itself. The 9× and 15× speedup figures, the 1/20 compute reduction, and the benchmark scores are author-reported, and sparse attention schemes have a history of looking great on synthetic long-context tests and then giving softer answers on adversarial retrieval. The page also says little about the training data behind the video modality, the hardware required to actually serve M3 at 1M context, or what the 'minimax-community' license actually permits commercially.

Forward-looking, the interesting question is who benefits if even half of the speedup claim survives reproduction. Inference vendors hosting open frontier weights pick up a fresh long-context option, agent framework builders get a model they can point at whole-codebase or whole-document workloads without paying closed-API per-token rates, and the research community gets an open reference for sparse attention at this scale to take apart.

Shared on Bluesky by 2 AI experts