reddit.com via Reddit

Gemma4 26B Apex Quant Hits 38 tok/s on RX 9060 XT

open source inference edge ai local-llm inference amd open-source

Key insights

  • mudler's APEX-I-Compact quant fits Gemma4 26B-A4B into 15GB VRAM, achieving 38 tok/s on AMD's RX 9060 XT 16GB.
  • A 90,000-token context window was reached with no reported quality loss or looping artifacts on RDNA4 consumer hardware.
  • llama.cpp's AMD backend delivers high-throughput Gemma4 MoE inference without Nvidia CUDA for the first time at this scale.

Why this matters

Consumer-grade AMD GPUs can now run Google's flagship open MoE model at competitive throughput, shifting the hardware calculus for on-premise AI deployment away from Nvidia-only solutions. The 90K-context result at 38 tok/s on a sub-$400 GPU means enterprise-grade context lengths are accessible to individuals, small teams, and cost-constrained organizations running local inference. Quantization formats like APEX-I-Compact are demonstrating that model compression can preserve functional quality while fitting aggressive VRAM budgets, the constraint that defines most edge and prosumer deployments today.

Summary

Google's Gemma4 26B, a mixture-of-experts model, is now running at 38 tokens per second with a 90,000-token context window on an AMD RX 9060 XT 16GB consumer GPU, according to a benchmark posted to r/LocalLLaMA. The result comes from mudler's APEX-I-Compact quantization, which packs the 26B-A4B model into roughly 15GB of VRAM. The inference runs through llama.cpp, with the community reporter noting no looping artifacts and no perceptible quality loss at full context capacity. Essentially: Google (Gemma4), AMD (RDNA4 / RX 9060 XT), and the open-source llama.cpp ecosystem are converging to make large MoE models practical on mid-range consumer hardware. - mudler's APEX-I-Compact quant fits within the 16GB VRAM ceiling while sustaining throughput previously requiring dedicated inference hardware - 90,000-token context at 38 tok/s is one of the first verified benchmarks for Gemma4 MoE on an RDNA4 GPU - llama.cpp's AMD backend enables this without proprietary CUDA infrastructure Enterprise-grade context lengths are no longer gated by data center hardware, and this benchmark is the clearest evidence yet.

Potential risks and opportunities

Risks

  • Community throughput reports lacking reproducible benchmarks could mislead enterprise buyers into deploying underperforming quant configurations at scale before quality regressions surface
  • If AMD's ROCm driver stack for RDNA4 has stability gaps, organizations adopting the RX 9060 XT for inference workloads could face silent failures or crashes at high context loads in production
  • mudler's APEX quant series is an unofficial format with no guaranteed maintenance path; any upstream Gemma4 update from Google could break compatibility without a supported migration route

Opportunities

  • AMD gains a concrete, community-validated inference benchmark to market the RX 9060 XT against Nvidia's RTX 4060 Ti in the growing on-premise and prosumer AI deployment segment
  • llama.cpp contributors and AMD ROCm engineers have a high-visibility reference workload to optimize further, with community pressure accelerating the RDNA4 inference feedback loop
  • Quantization toolkit developers building on the APEX-I-Compact approach can use this as a validated template for compressing other large MoE models (Mixtral, DeepSeek) to sub-16GB consumer targets

What we don't know yet

  • No standardized quality evaluation (MMLU, MT-Bench, or perplexity scores) was published alongside the throughput claim, leaving degradation magnitude unquantified
  • Whether the 38 tok/s figure holds under sustained multi-turn workloads or degrades during extended 90K-context inference sessions is untested
  • llama.cpp's RDNA4 ROCm and Vulkan driver stability at production scale remains unverified beyond this single community report