github.com via Reddit June 7th 2026

llama.cpp Adds Gemma-4 E2B/E4B MTP Assistant Support

google open source inference open-source local-inference

Key insights

PR #24282, merged June 8, adds Gemma-4 E2B/E4B assistant model support for multi-token prediction in llama.cpp.
On Galaxy S26+ with n-draft=3, the drafter achieved 47.89% acceptance and roughly 1-2 TPS gain over baseline.
llama-speculative crashes loading these models at merge; the widely cited 40% speedup figure is absent from the PR.

Why this matters

Llama.cpp's MTP infrastructure now covers Gemma-4's smallest assistant models, making speculative decoding available on mobile hardware like the Galaxy S26+. The merge simultaneously exposes a tooling gap: llama-speculative and llama-bench both fail with E2B/E4B models, meaning llama.cpp's non-server inference paths lag behind llama-server in MTP readiness. Performance claims circulating about this PR are significantly overstated; the actual benchmarks show 47.89% acceptance and 14.04 tokens/sec on a Galaxy S26+, not the 40% speedup figures being cited externally.

Summary

PR #24282, merged June 8 by max-krasnyansky, adds Gemma-4 E2B/E4B assistant model support for multi-token prediction in llama.cpp. The core work centered on two new tensors: masked_embedding.centroids.weight and masked_embedding.token_ordering. Reviewer CISC pushed to filter them from conversion entirely rather than mark them optional, since metadata support for them was absent. Essentially: (max-krasnyansky, ggerganov, CISC) extend llama.cpp's Gemma-4 MTP path to cover the compact E-series assistant models. - On Galaxy S26+, the drafter hit 47.89% acceptance (125/261 generated) and delivered about 1-2 TPS bump with n-draft=3; prompt eval ran at 682.51 tokens/sec, generation at 14.04 tokens/sec. - llama-speculative crashes on load with these models; llama-bench cannot load the drafter independently. The 40% speedup figure in some coverage does not appear in this PR. Measured gains are real but modest, and the speculative toolchain gap remains open.

Potential risks and opportunities

Risks

Developers relying on llama-speculative for non-server speculative decoding will hit a hard crash loading Gemma-4 E2B/E4B until the 'Gemma4Assistant requires ctx_other to be set' error is resolved.
The 47.89% acceptance rate and 1-2 TPS gain on Galaxy S26+ with n-draft=3 may not justify MTP overhead on lower-end hardware, leaving teams with neutral or negative throughput after integration.
Inflated 40% speedup claims circulating externally could push developers to ship integrations against benchmarks that do not appear in this PR, producing user-facing performance gaps.

Opportunities

llamafile and other llama.cpp downstream forks referenced in the PR can now expose Gemma-4 E2B/E4B MTP to end users with minimal additional porting work.
The Galaxy S26+ Hexagon backend run provides the first real-world Gemma-4 MTP data point on mobile NPU hardware, giving teams optimizing for on-device inference a concrete calibration baseline.
Contributors who fix the llama-speculative crash and llama-bench drafter loading gap will unblock Gemma-4 E2B/E4B speculative decoding across the full non-server llama.cpp ecosystem.

What we don't know yet

Whether the llama-speculative segfault ('Gemma4Assistant requires ctx_other to be set') will be fixed before Gemma-4 E2B/E4B see broad non-server deployment.
This PR includes no benchmark data beyond Galaxy S26+; performance on desktop GPUs or other mobile SoCs is unknown.
The masked embedding tensors are described as a draft-step speed win rather than an acceptance win, but no ablation data in the PR quantifies their individual contribution.

Originally reported by github.com

Read the original article →

Original headline: llama.cpp Merges Gemma-4 E2B/E4B Multi-Token Prediction Support, Delivering 40% Local Inference Speedup