Gemma 4 MTP Support Proposed for llama.cpp
Key insights
- llama.cpp's MTP support, validated on Qwen3, is being extended to Google's Gemma 4 family via PR #23398.
- The Gemma 4 31B-A4B MoE variant has already been benchmarked locally on RTX 5060 Ti consumer hardware.
- MTP throughput gains apply to generation speed by predicting multiple tokens per forward pass, reducing total inference steps.
Why this matters
Multi-Token Prediction is one of the most practical near-term throughput levers for local inference, and its expansion to Gemma 4 means another capable open-weight model family becomes meaningfully faster on consumer and prosumer hardware without any model change. For founders and teams building on llama.cpp-based stacks, Gemma 4 MoE becoming a first-class MTP target affects which models are viable for latency-sensitive local deployments. The pattern also signals that MTP is becoming a baseline expectation in llama.cpp rather than a model-specific feature, which will pressure model authors to publish MTP-compatible weights as a default.
Summary
A work-in-progress pull request in ggml-org/llama.cpp is pushing to extend Multi-Token Prediction support to Google's Gemma 4 model family, including the 31B-A4B MoE variant. The PR (#23398) follows the recent mainline MTP merge that landed primarily for Qwen3 models, and would bring the same inference throughput gains to Gemma 4 architecture.
MTP allows the model to predict multiple tokens per forward pass rather than one, reducing total forward passes and improving generation speed on compatible hardware. The gains already validated on Qwen and other dense models would translate to Gemma 4 if the PR merges cleanly.
Essentially: (Google's Gemma 4, llama.cpp community) are the key players here, with the open-source inference runtime catching up to Gemma 4's architecture specifics.
- The 31B-A4B MoE variant has already been benchmarked on RTX 5060 Ti hardware, making it a concrete local-inference target.
- MTP support is currently WIP and not yet in a nightly build, so community benchmarking has not started.
- The Qwen3 MTP merge set the structural precedent this PR is extending.
As MTP support widens across model families in llama.cpp, the throughput gap between local and cloud inference continues to narrow for capable open models.
Potential risks and opportunities
Risks
- If the PR introduces correctness regressions in Gemma 4 MoE token routing, early nightly adopters could see silent output quality degradation before the bug is caught.
- llama.cpp maintainers face review bandwidth pressure as MTP support requests multiply across model families simultaneously, potentially slowing merges for all pending PRs.
- Community benchmarking on RTX 5060 Ti hardware may not generalize to older VRAM-constrained cards, narrowing the practical audience for the speedup and creating fragmented performance expectations.
Opportunities
- Hardware vendors targeting local AI workloads (Nvidia, with RTX 5060 Ti already in the benchmark chain) gain a concrete marketing data point as Gemma 4 MoE becomes faster on consumer GPUs.
- Developers building Gemma 4-based local applications on llama.cpp runtimes (LM Studio, Ollama, Jan) can position MTP-enabled Gemma 4 as a competitive alternative to cloud API calls for throughput-sensitive tasks once the PR ships.
- Google DeepMind benefits from community-driven inference optimization at no cost, strengthening Gemma 4's position as a local-deployment-friendly open model family relative to competitors without active llama.cpp integration.
What we don't know yet
- Whether the PR correctly handles Gemma 4's MoE routing during multi-token prediction steps, which differs architecturally from dense Qwen3 models.
- No confirmed timeline for when PR #23398 will land in a nightly llama.cpp build, leaving community benchmarking on hold.
- Whether MTP speedup ratios on Gemma 4 31B-A4B will match the gains seen on Qwen3 dense models, given the MoE sparsity difference.
Originally reported by github.com
Read the original article →Original headline: r/LocalLLaMA: WIP Gemma 4 MTP Support Proposed in llama.cpp PR #23398