reddit.com via Reddit May 20th 2026

Qwen3.6 35B MoE Loses MTP Speedup at 128k Context

open source inference local-inference mtp moe-models

Key insights

MTP provides no measurable throughput gain on Qwen3.6 35B MoE at 128k context, unlike its roughly 2x boost on dense models.
Attention cost dominates inference at 128k, absorbing the entire speedup MTP's draft-verify step would otherwise deliver.
RTX 5080 16GB achieves 56 tok/s on Qwen3.6 35B MoE at 128k, establishing a concrete consumer-GPU production baseline.

Why this matters

Practitioners deploying MoE models for coding agents at 128k context cannot rely on MTP to improve throughput; attention cost is the primary bottleneck at that scale, and no configuration change fixes it. The llama.cpp MTP merge generated broad enthusiasm for local inference acceleration, but this benchmark isolates the exact architectural condition under which that gain collapses to zero. Consumer hardware buyers and AI product teams targeting MoE inference at long context should not expect MTP to change their effective tokens-per-second ceiling when planning deployments.

Summary

MTP (Multi-Token Prediction) delivers zero measurable throughput gain on Mixture-of-Experts models at 128k context, according to benchmarks run on an RTX 5080 16GB with Qwen3.6 35B MoE. A developer ran three configuration comparisons and found attention cost alone consumes the entire draft-verify benefit that MTP provides on dense models. At production coding-agent context lengths, attention dominates the inference budget so completely that speculative drafting has nothing left to recover. Essentially: (Qwen3.6 35B MoE, llama.cpp) the MTP merge that generated community enthusiasm does not transfer to MoE architectures at 128k context. - RTX 5080 16GB achieves 56 tok/s on Qwen3.6 35B MoE at 128k, a practical consumer-GPU production baseline. - MTP delivers roughly 2x gains on dense models but zero gain on MoE at the same context length. - Attention cost, not draft-verify overhead, is the binding constraint past long context on MoE architectures. This result draws a hard architectural line between dense-model MTP optimism and MoE reality at the context lengths real coding agents actually use.

Potential risks and opportunities

Risks

Developers who purchased RTX 5080 hardware specifically for MTP-accelerated MoE inference at long context will not see advertised gains, creating hardware ROI risk for early adopters who sized builds around MTP throughput projections.
llama.cpp maintainers face documentation credibility pressure if MTP guidance does not clearly segment dense-model versus MoE performance expectations, as community benchmarks now contradict the implied general applicability.
AI coding agent products built on MoE backends at 128k context face a latency ceiling that MTP cannot address, limiting throughput differentiation against dense-model competitors that do benefit from MTP.

Opportunities

Attention optimization library teams behind FlashInfer and vLLM have a clear opening to demonstrate MoE-specific attention kernels that improve throughput at 128k and capture mindshare among local inference developers.
Cloud inference providers such as Together AI and Fireworks AI running dense models with MTP can use this benchmark to sharpen their throughput story against local MoE deployments on consumer hardware.
Next-generation GPU vendors including AMD with MI350 and Nvidia with RTX 5090 successors can prioritize attention throughput at 128k as a differentiated benchmark for MoE workloads, where current hardware hits a hard ceiling.

What we don't know yet

Whether MTP speedup recovers at shorter context lengths such as 8k to 32k on the same Qwen3.6 MoE architecture and RTX 5080 hardware.
Whether the zero-gain finding holds across other MoE models including Mixtral and DeepSeek-V2, or is specific to Qwen3.6's expert routing behavior.
Whether any attention optimization technique such as FlashAttention variants or sparse attention could restore MTP effectiveness on MoE at 128k context.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: RTX 5080 16GB Benchmarks Qwen3.6 35B MoE at 128k Context — MTP Delivers Zero Speedup on MoE Models, Attention Dominates