reddit.com via Reddit

Gemma 4 MTP Benchmark Hits 3.34x Lossless Speedup

open source inference inference local-llm speculative-decoding benchmarks

Key insights

  • MTP on Gemma 4 31B and Qwen 3.6 27B achieved up to 3.34x lossless speedup with 70-85% draft acceptance rates on an RTX 6000 PRO.
  • Unlike traditional speculative decoding, MTP requires no separate draft model as it is baked into DeepSeek V3-style training natively.
  • Full configuration steps for both vLLM FP8 and llama.cpp GGUF backends are now publicly documented by a community developer.

Why this matters

MTP as a training-integrated capability means model quality and inference speed are now co-optimized at training time, rewriting how teams evaluate and deploy open-weight models. The 3.34x speedup on prosumer hardware with no quality loss directly affects cost-per-token for local and on-premise deployments, making previously cost-prohibitive workloads viable on single-GPU setups. As Gemma 4 and Qwen 3 gain community-validated MTP configs, the window for inference optimization startups selling separate speed layers is narrowing.

Summary

MTP tests on Gemma 4 31B and Qwen 3.6 27B hit 3.34x lossless speedup on an RTX 6000 PRO, per one of LocalLLaMA's top technical threads this week. The gains come from 70-85% draft acceptance rates, versus the roughly 50% ceiling of traditional speculative decoding. MTP is native to DeepSeek V3-style training and needs no separate draft model. Essentially: (Google DeepMind, Alibaba Qwen, DeepSeek) now ship models where inference speed is baked in at training time. - 3.34x peak, lossless, on RTX 6000 PRO with vLLM FP8 and llama.cpp GGUF - Draft acceptance of 70-85% vs roughly 50% for standard speculative decoding - Full configs published; community-reproducible on prosumer hardware Local inference is splitting into teams running MTP and those still on unoptimized stacks.

Potential risks and opportunities

Risks

  • Practitioners deploying MTP based on this single community benchmark may encounter acceptance rate drops on domain-specific workloads, producing unexpected latency regressions in production pipelines
  • vLLM and llama.cpp MTP implementations are still evolving; a breaking config change in either backend could silently degrade performance for teams running unmonitored inference deployments
  • The 3.34x speedup is hardware-specific to the RTX 6000 PRO; cloud providers serving Gemma 4 and Qwen 3 on A100 or H100 clusters may see materially different gains, setting false expectations for teams moving from local to cloud inference

Opportunities

  • Nvidia can position the RTX 6000 PRO as the reference platform for MTP benchmarking, reinforcing prosumer GPU sales targeting local inference practitioners through Q3 2026
  • vLLM and llama.cpp maintainers can accelerate MTP documentation and tooling to capture the workflow of the growing practitioner base standardizing on MTP as their default inference config
  • Inference API providers such as Fireworks AI, Together AI, and Anyscale can differentiate by offering MTP-enabled endpoints for Gemma 4 and Qwen 3 with verified acceptance rate disclosures as a marketing lever

What we don't know yet

  • Whether MTP acceptance rates hold at 70-85% across diverse prompt distributions such as code, long-form prose, and multi-turn chat, or are specific to the benchmark workloads tested
  • Whether the RTX 6000 PRO results translate to consumer cards like the RTX 4090 or 3090, given their different memory bandwidth and VRAM profiles
  • Whether Google DeepMind and the Qwen team plan to publish official MTP benchmarks and certified inference configurations for their own model releases