llama.cpp Adds Gemma-4 E2B/E4B MTP Assistant Support
Key insights
- PR #24282, merged June 8, adds Gemma-4 E2B/E4B assistant model support for multi-token prediction in llama.cpp.
- On Galaxy S26+ with n-draft=3, the drafter achieved 47.89% acceptance and roughly 1-2 TPS gain over baseline.
- llama-speculative crashes loading these models at merge; the widely cited 40% speedup figure is absent from the PR.
Why this matters
Llama.cpp's MTP infrastructure now covers Gemma-4's smallest assistant models, making speculative decoding available on mobile hardware like the Galaxy S26+. The merge simultaneously exposes a tooling gap: llama-speculative and llama-bench both fail with E2B/E4B models, meaning llama.cpp's non-server inference paths lag behind llama-server in MTP readiness. Performance claims circulating about this PR are significantly overstated; the actual benchmarks show 47.89% acceptance and 14.04 tokens/sec on a Galaxy S26+, not the 40% speedup figures being cited externally.
Summary
PR #24282, merged June 8 by max-krasnyansky, adds Gemma-4 E2B/E4B assistant model support for multi-token prediction in llama.cpp.
The core work centered on two new tensors: masked_embedding.centroids.weight and masked_embedding.token_ordering. Reviewer CISC pushed to filter them from conversion entirely rather than mark them optional, since metadata support for them was absent.
Essentially: (max-krasnyansky, ggerganov, CISC) extend llama.cpp's Gemma-4 MTP path to cover the compact E-series assistant models.
- On Galaxy S26+, the drafter hit 47.89% acceptance (125/261 generated) and delivered about 1-2 TPS bump with n-draft=3; prompt eval ran at 682.51 tokens/sec, generation at 14.04 tokens/sec.
- llama-speculative crashes on load with these models; llama-bench cannot load the drafter independently.
The 40% speedup figure in some coverage does not appear in this PR. Measured gains are real but modest, and the speculative toolchain gap remains open.
Potential risks and opportunities
Risks
- Developers relying on llama-speculative for non-server speculative decoding will hit a hard crash loading Gemma-4 E2B/E4B until the 'Gemma4Assistant requires ctx_other to be set' error is resolved.
- The 47.89% acceptance rate and 1-2 TPS gain on Galaxy S26+ with n-draft=3 may not justify MTP overhead on lower-end hardware, leaving teams with neutral or negative throughput after integration.
- Inflated 40% speedup claims circulating externally could push developers to ship integrations against benchmarks that do not appear in this PR, producing user-facing performance gaps.
Opportunities
- llamafile and other llama.cpp downstream forks referenced in the PR can now expose Gemma-4 E2B/E4B MTP to end users with minimal additional porting work.
- The Galaxy S26+ Hexagon backend run provides the first real-world Gemma-4 MTP data point on mobile NPU hardware, giving teams optimizing for on-device inference a concrete calibration baseline.
- Contributors who fix the llama-speculative crash and llama-bench drafter loading gap will unblock Gemma-4 E2B/E4B speculative decoding across the full non-server llama.cpp ecosystem.
What we don't know yet
- Whether the llama-speculative segfault ('Gemma4Assistant requires ctx_other to be set') will be fixed before Gemma-4 E2B/E4B see broad non-server deployment.
- This PR includes no benchmark data beyond Galaxy S26+; performance on desktop GPUs or other mobile SoCs is unknown.
- The masked embedding tensors are described as a draft-step speed win rather than an acceptance win, but no ablation data in the PR quantifies their individual contribution.
Originally reported by github.com
Read the original article →Original headline: llama.cpp Merges Gemma-4 E2B/E4B Multi-Token Prediction Support, Delivering 40% Local Inference Speedup