reddit.com via Reddit

Llama.cpp MTP boosts Qwen3 coding speed 71%

inference open source local-inference mtp-benchmarks nvidia-gpu

Key insights

  • MTP in llama.cpp delivers a measured 71% decode speedup on Qwen3.6-27B within a real coding session, not a synthetic benchmark.
  • The RTX 3090 with 24GB VRAM and Q4_K_M quantization is now a confirmed reference configuration for production MTP use.
  • Setting --draft-p-min 0.85 keeps prefill overhead negligible, resolving a key community concern about MTP quality tradeoffs.

Why this matters

For AI practitioners self-hosting large models, this is a validated configuration showing that speculative decoding via MTP can cross a threshold where latency improvements are felt interactively in coding workflows, not just measured in benchmarks. Founders building on local inference infrastructure now have a concrete data point for provisioning 27B-class models on single-GPU nodes without upgrading hardware. The broader implication is that the effective throughput ceiling on consumer GPUs is still moving upward through software optimization alone, which changes the calculus on when to invest in multi-GPU or cloud inference scaling.

Summary

A developer running Qwen3.6-27B on a headless RTX 3090 has documented a 71% decode throughput gain by enabling Multi-Token Prediction (MTP) in llama.cpp, pushing output from 38 tok/s to 65 tok/s in a live OpenCode coding session with 128K context active. The configuration used --spec-draft-n-max 3 and --draft-p-min 0.85, keeping the draft acceptance threshold high enough that prefill overhead stayed negligible. The post includes both prefill and decode metrics, which is the specific data the local LLM community had been waiting for before committing to MTP in production workflows. Essentially: (Meta's Llama.cpp project, Qwen team at Alibaba) now have a concrete 24GB VRAM reference point showing MTP is production-ready for coding tasks. - Decode rate: 38 tok/s baseline to 65 tok/s with MTP, measured in a real OpenCode session rather than a synthetic benchmark. - Draft parameters: --spec-draft-n-max 3 limits speculative tokens per step; --draft-p-min 0.85 rejects low-confidence drafts, keeping quality high. - Hardware ceiling: RTX 3090 with 24GB VRAM is the most common high-end consumer GPU, making this result directly replicable for a large segment of the local-model community. The result moves MTP from a promising feature to a validated throughput lever for anyone running 27B-class models on a single consumer GPU.

Potential risks and opportunities

Risks

  • Developers who tune --draft-p-min below 0.85 chasing higher throughput may silently degrade output quality in production coding agents without a clear failure signal.
  • The result is specific to Q4_K_M quantization on a 24GB card; users attempting MTP on 16GB cards (RTX 3080/4080) may hit VRAM ceilings that invalidate the configuration entirely.
  • If Qwen or the llama.cpp maintainers change MTP weight format or draft sampling logic in upcoming releases, production deployments built on this specific config could regress without warning.

Opportunities

  • Local inference UI projects (LM Studio, Open WebUI, Jan) can differentiate by surfacing MTP configuration presets tuned to this validated reference, lowering setup friction for non-technical users.
  • Cloud GPU rental platforms (RunPod, Vast.ai) offering RTX 3090 nodes can market single-GPU Qwen3 MTP configurations as a cost-competitive alternative to multi-GPU inference for coding workloads.
  • Quantization tooling teams (llama.cpp, ExLlamaV2, mlx-lm) have a concrete benchmark to target for competitive MTP support, with the 65 tok/s bar now publicly set on 24GB hardware.

What we don't know yet

  • Whether the 71% speedup holds at shorter context lengths (under 32K) or degrades as the draft model's context advantage shrinks.
  • No data on acceptance rate variance across different coding task types -- whether the 0.85 draft threshold stays optimal for non-Python or mixed-language sessions.
  • Whether Qwen3.6-27B-MTP weights are the only model family currently optimized for this llama.cpp MTP path, or if other 27B-class models can replicate the result.