reddit.com via Reddit May 18th 2026

oMLX Tops MLX Backends on Apple M5 Max for Qwen 35B

open source inference edge ai local-inference apple-silicon mlx

Key insights

oMLX outperformed both mlx-lm and MTPLX on token-per-second throughput for Qwen 3.6 35B-A3B-4bit on Apple M5 Max 64GB.
MTPLX was benchmarked only on the 27B model, not the 35B-A3B used for oMLX and mlx-lm comparisons.
This is among the first published multi-backend MLX inference benchmarks specifically targeting M5 Max hardware.

Why this matters

Apple Silicon has matured into a viable local inference platform, and M5 Max-specific throughput data directly shapes hardware purchasing and stack decisions for developers running large models without cloud infrastructure. oMLX outperforming the default mlx-lm backend means developers relying on Apple's reference implementation may be leaving significant throughput on the table for MoE architectures like Qwen 3.6 35B-A3B. The near-total absence of published Apple Silicon benchmarks at this model scale has been a real friction point for teams evaluating whether M5 Max hardware justifies the cost over Nvidia alternatives for on-device inference.

Summary

A community developer published one of the first head-to-head MLX inference backend comparisons on Apple M5 Max hardware, finding oMLX consistently outperforms both the standard mlx-lm reference implementation and MTPLX on Qwen 3.6 35B-A3B-4bit throughput. The test ran on a 64GB M5 Max using the 4-bit quantized Qwen 3.6 35B-A3B model. MTPLX, a speculative-decoding-based backend, was only benchmarked against the 27B model, leaving its performance on the 35B-A3B architecture unconfirmed. Full token-per-second numbers live on an external blog linked from the Reddit post, not inline. Essentially: (oMLX, mlx-lm, MTPLX) are the competing inference stacks for Apple Silicon local inference, and oMLX now holds the documented performance lead on the latest Mac hardware. - oMLX delivered the highest throughput across all tested backends on M5 Max 64GB for Qwen 3.6 35B-A3B-4bit. - MTPLX was only tested on the 27B model, so direct parity comparison against oMLX on the 35B-A3B architecture is still missing. - Most published MLX benchmarks target Nvidia GPUs, making M5 Max-specific data a practical gap this post begins to fill. For Apple Silicon developers, backend selection has direct consequences on which model sizes and throughput targets are achievable on local hardware without cloud fallback.

Potential risks and opportunities

Risks

Developers adopting oMLX based on a single community benchmark without methodology details could encounter undocumented instability or regression, since oMLX is not Apple's maintained reference implementation.
The incomplete MTPLX comparison (27B only) means the recommendation against speculative decoding backends could be reversed once 35B-A3B data is published, making early stack commitments premature.
Hardware purchasing decisions for Apple Silicon versus Nvidia inference infrastructure made on the basis of this benchmark could be skewed if the token-per-second numbers don't hold under production prompt distributions or longer context windows.

Opportunities

oMLX maintainers can convert this benchmark visibility into accelerated adoption and contributor growth from the Apple Silicon inference developer community.
Local inference frontends like LM Studio and Ollama could surface oMLX as a selectable backend option, using the M5 Max data as justification for the integration investment.
Apple's developer relations team gains a concrete third-party performance narrative for M5 Max in large-model inference workloads, useful for enterprise sales targeting on-device AI deployment.

What we don't know yet

MTPLX performance on Qwen 3.6 35B-A3B is untested -- the comparison only covers the 27B model, leaving speculative decoding's ceiling at this scale unknown.
Whether oMLX's throughput lead replicates on M5 Ultra or M4 Max configurations, which were not included in this benchmark run.
Benchmark methodology details -- prompt length, context window size, batch configuration, and thermal state during runs -- were not published inline, limiting reproducibility.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: MLX Engine Comparison on M5 Max 64GB — oMLX Tops All Backends for Qwen 3.6 35B-A3B, MTPLX Tested on 27B