github.com via Reddit

llama.cpp Boosts WebGPU K-Quant Prefill up to 3.78x

open source inference local-inference webgpu llama-cpp quantization

Key insights

  • Extracting four quantization values per u32 element instead of one drives gains from 1.33x to 3.78x across Q2_K through Q6_K on M2 Pro.
  • Q3_K sees the highest gain at 3.78x on Gemma4 E4B, jumping from 79.06 to 298.73 tokens per second at pp512.
  • The refactor ships in llama.cpp main with no user configuration needed, covering k-quants plus Q4, Q5, Q8, and MXFP4 types.

Why this matters

WebGPU is the only inference path for llama.cpp in browsers and on systems without Metal or CUDA, so kernel throughput directly determines which quantized models are usable at interactive speeds. Gains of 3.27x to 3.78x on Q3_K across both Qwen3.5 4B and Gemma4 E4B on M2 Pro demonstrate the improvement generalizes across model architectures, not just a single test case. The four-values-per-u32 extraction pattern also establishes a proven template for follow-on optimization of IQ-series and other quantization types still on the older single-value codepath.

Summary

A WebGPU kernel rewrite merged into llama.cpp on June 8 delivers 1.33x to 3.78x prefill speedups for k-quantized models, with no configuration changes required. Author yomaytk changed matrix multiplication in the WebGPU backend to extract four quantization values per u32 element instead of one, cutting memory reads. The patch also eliminates code duplication across Q4, Q5, Q8, and MXFP4 implementations. Essentially: (yomaytk, ggml-org/llama.cpp) ship a kernel rewrite that lands immediately for all WebGPU inference users. - Q3_K on Gemma4 E4B peaks at 3.78x at pp512 on M2 Pro (79.06 to 298.73 t/s). - Q3_K on Qwen3.5 4B gains 3.27x (92.54 to 302.24 t/s) on the same hardware. - Q4_K through Q6_K improve 1.33x to 1.52x across Qwen3.5 and Gemma4 variants.

Potential risks and opportunities

Risks

  • Gains were validated only on M2 Pro via browser WebGPU; AMD and Nvidia WebGPU driver differences could surface precision or correctness regressions not caught before the merge.
  • The author disclosed using AI to confirm k-quant structures; if that verification missed an edge case, follow-on PRs building on this refactor could propagate a structural assumption error across more quantization types.
  • Q4_K through Q6_K gains of 1.33x to 1.52x are modest enough that users on non-M2 hardware may benchmark regressions on specific configurations, generating community pushback against the broader refactor.

Opportunities

  • Browser-based LLM inference frameworks can immediately update their llama.cpp dependency to capture Q3_K speedups of 3.27x to 3.78x, making larger quantized models viable for in-browser real-time prefill.
  • The four-values-per-u32 extraction pattern is directly portable to other ggml WebGPU kernels, giving the ggml-org team a proven template for follow-on PRs targeting IQ-series and other unoptimized quantization types.
  • Gemma4 and Qwen3.5 developers targeting WebGPU deployment now have concrete M2 Pro baselines (298-302 t/s for Q3_K at pp512) to guide quantization selection for real-time use cases.

What we don't know yet

  • Performance on non-Apple WebGPU hardware (Windows and Linux with AMD or Nvidia GPUs) was not benchmarked, leaving cross-platform gain estimates unverified.
  • Whether IQ-series quantization types such as IQ1_S and IQ2_XS will receive the same four-value extraction treatment is not addressed in this PR.
  • The contributor disclosed using AI to confirm k-quant data structures, but no secondary human review of those structural assumptions is documented.