Reddit / r/LocalLLaMA via Reddit

Cohere Command A+ 218B runs on Apple Silicon via MLX

cohere open source inference open-source local-llm apple-silicon

Key insights

  • Cohere Command A+ activates only 25B of 218B parameters per token, making local inference feasible on high-memory M-series Macs.
  • The Apache 2.0 license allows commercial deployment without Cohere licensing agreements, unlike many comparable-scale models.
  • A community-driven MLX pull request, not an official Cohere release, enabled Apple Silicon support for this 218B MoE model.

Why this matters

Running a 218B-parameter enterprise model locally on Apple Silicon removes the cloud API dependency for organizations with data-residency or latency constraints, which directly expands the viable deployment surface for Cohere's open-weight stack. The MLX community's ability to port models of this scale before the original vendor does signals that open-weight release under permissive licenses increasingly shifts the integration roadmap from the model provider to the ecosystem. For founders and infrastructure teams, this confirms that MoE sparsity at the 25B-active-parameter range is now within reach of prosumer hardware, which reshapes the cost baseline for on-device enterprise inference in 2026.

Summary

Cohere's Command A+, a 218-billion-parameter mixture-of-experts model released under Apache 2.0, can now run natively on Apple Silicon after a community developer built a cohere2_moe implementation for the MLX framework and opened a pull request in the mlx-lm repository. The model's architecture activates only 25 billion of its 218 billion parameters per forward pass, drawing on 8 of 128 experts plus a single shared expert. That sparsity is what makes local inference on M-series hardware plausible at all -- the active compute footprint is a fraction of the nominal model size, though memory bandwidth requirements remain steep. Essentially: (Cohere, Apple) a large open-weight enterprise model meets consumer-grade silicon without a GPU cluster in the middle. - 218B total parameters, 25B active per token, 128 experts with top-8 routing and one shared expert - Pull request now open in mlx-lm; commenters independently confirmed successful loading on M-series hardware - Apache 2.0 license means downstream commercial use requires no royalty or special agreement with Cohere The successful port extends the frontier of what counts as "locally runnable" for enterprise-grade open-weight models, compressing a capability that previously required multi-GPU infrastructure into hardware that sits on a desk.

Potential risks and opportunities

Risks

  • If the mlx-lm PR stalls without merge, downstream developers building on the cohere2_moe branch face an unmaintained fork as MLX evolves rapidly
  • Users loading the full 218B weights without confirming memory headroom risk system instability on 128GB unified-memory Macs, potentially discouraging adoption before official hardware guidance exists
  • Cohere's enterprise positioning could be complicated if the Apache 2.0 local path becomes the default for cost-sensitive buyers, reducing commercial API revenue leverage against OpenAI and Anthropic

Opportunities

  • Apple has a direct incentive to publicize Command A+ MLX performance on the M3 Ultra and upcoming M4 Ultra as a proof point for the Mac Pro's enterprise AI positioning
  • MLX tooling vendors and fine-tuning platforms (Axolotl, Unsloth) could rapidly add Command A+ support to capture developer mindshare while the model is generating community momentum
  • Enterprises in regulated industries (legal, healthcare, finance) evaluating on-premises LLM deployments now have a credible 218B-parameter Apache 2.0 option to benchmark against hosted alternatives, strengthening the negotiating position of on-prem AI infrastructure vendors

What we don't know yet

  • Peak unified memory requirement for full-weight loading on M-series hardware -- 192GB Mac Studio or 128GB MacBook Pro threshold not confirmed in the thread
  • Whether Cohere plans to officially support or co-maintain the MLX implementation, or whether mlx-lm will carry it as a community contribution only
  • Inference throughput benchmarks (tokens per second) on specific M-series chips are absent from the post and comments as of the thread date