reddit.com via Reddit May 19th 2026

MTP Hurts Apple Silicon LLM Speed Up to 25%

inference open source local-ai inference

Key insights

MTP in llama.cpp reduces Apple Silicon inference speed by 20-25%, confirmed across M1, M2, and M3 chip generations.
Nvidia and AMD RDNA4 hardware sees 50-70%+ MTP speedups, making Apple Silicon a clear architectural outlier.
No draft token count (2, 3, or 4) recovers baseline throughput on Apple Silicon, ruling out simple parameter tuning.

Why this matters

Developers choosing Apple Silicon for local LLM inference specifically for its unified memory advantages now have evidence that a major llama.cpp optimization path is net-negative on their hardware, which changes deployment defaults and benchmark comparisons across the entire local-AI tooling ecosystem. The finding establishes a hardware-class split in MTP behavior that llama.cpp maintainers will need to address with architecture-specific dispatch or documentation, affecting how inference frameworks are packaged and configured for Mac-based users. For founders and teams evaluating on-device AI infrastructure, this is a concrete data point that hardware architecture determines which inference optimizations are applicable, and that cross-platform benchmark claims need hardware-class caveats to be actionable.

Summary

Multi-Token Prediction in llama.cpp is actively harming throughput on Apple Silicon, with community benchmarks showing consistent degradation rather than the gains seen on competing hardware. An M2 Max with 96GB RAM running Qwen 3.6-27B drops from a ~12 t/s baseline to ~9-10 t/s across every MTP draft configuration tested: 2-token, 3-token, and 4-token all perform worse. The pattern holds across M1, M2, and M3 generations, suggesting this is architectural, not a chip-tier issue. On Nvidia and AMD RDNA4 hardware, the same MTP configurations produce 50-70%+ throughput gains. Essentially: (Apple Silicon, llama.cpp) the memory bandwidth and unified memory architecture that make Apple chips attractive for local inference appear to be fundamentally incompatible with the speculative decoding assumptions MTP relies on. - Throughput loss is 20-25% at the tested scale, not marginal - No draft token count (2, 3, or 4) recovers the baseline, ruling out tuning as a fix - Tool-heavy agentic pipelines were already flagged as a second class of workloads where MTP underperforms For the growing base of developers running local models on Apple hardware, MTP should be treated as a disabled-by-default feature until llama.cpp produces an architecture-specific implementation.

Potential risks and opportunities

Risks

llama.cpp-based application developers who enable MTP globally across hardware targets will ship Mac builds that are measurably slower than their advertised benchmarks, eroding user trust in local AI tools
Apple faces reputational risk in the local AI developer community if the M3 Ultra and M4 family launches with the same MTP regression, reinforcing a narrative that Apple Silicon optimizes for marketing benchmarks over practical inference workloads
Quantized model providers (Unsloth, froggeric) whose releases are benchmarked with MTP enabled on Nvidia may see misleading performance comparisons that disadvantage Apple Silicon users downloading their weights

Opportunities

Inference runtime vendors targeting Apple Silicon (MLX team at Apple, llama.cpp contributors) have a clear gap to fill by shipping a Metal-native speculative decoding implementation that accounts for unified memory bandwidth constraints
Benchmark transparency tooling providers could build hardware-class-aware inference dashboards that surface MTP on/off deltas per chip family, filling a gap that community Reddit threads currently cover informally
Apple Silicon cloud providers (RunPod Mac instances, MacStadium) can differentiate by publishing honest MTP-disabled throughput benchmarks, positioning themselves as reliable reference points for local-model developers choosing hardware

What we don't know yet

Whether Apple's Metal Performance Shaders or ANE could be used to implement a variant of speculative decoding that avoids the bottleneck causing the regression on unified memory architectures
Whether the throughput degradation scales with model size or VRAM pressure, or whether it is constant overhead regardless of quantization level or parameter count
Whether llama.cpp maintainers plan to add Apple Silicon detection to auto-disable MTP, or whether the fix requires a deeper architectural change to the draft-token evaluation path

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: MTP Consistently Slower on Apple Silicon — M2 Max 96GB Drops From 12 t/s to 9 t/s Across All Draft Configs on Qwen 3.6-27B