reddit.com via Reddit May 23rd 2026

Gemma4 26B Apex Quant Hits 38 tok/s on RX 9060 XT

open source inference edge ai local-llm inference amd open-source

Key insights

mudler's APEX-I-Compact quant fits Gemma4 26B-A4B into 15GB VRAM, achieving 38 tok/s on AMD's RX 9060 XT 16GB.
A 90,000-token context window was reached with no reported quality loss or looping artifacts on RDNA4 consumer hardware.
llama.cpp's AMD backend delivers high-throughput Gemma4 MoE inference without Nvidia CUDA for the first time at this scale.

Why this matters

Consumer-grade AMD GPUs can now run Google's flagship open MoE model at competitive throughput, shifting the hardware calculus for on-premise AI deployment away from Nvidia-only solutions. The 90K-context result at 38 tok/s on a sub-$400 GPU means enterprise-grade context lengths are accessible to individuals, small teams, and cost-constrained organizations running local inference. Quantization formats like APEX-I-Compact are demonstrating that model compression can preserve functional quality while fitting aggressive VRAM budgets, the constraint that defines most edge and prosumer deployments today.

Summary

Google's Gemma4 26B, a mixture-of-experts model, is now running at 38 tokens per second with a 90,000-token context window on an AMD RX 9060 XT 16GB consumer GPU, according to a benchmark posted to r/LocalLLaMA. The result comes from mudler's APEX-I-Compact quantization, which packs the 26B-A4B model into roughly 15GB of VRAM. The inference runs through llama.cpp, with the community reporter noting no looping artifacts and no perceptible quality loss at full context capacity. Essentially: Google (Gemma4), AMD (RDNA4 / RX 9060 XT), and the open-source llama.cpp ecosystem are converging to make large MoE models practical on mid-range consumer hardware. - mudler's APEX-I-Compact quant fits within the 16GB VRAM ceiling while sustaining throughput previously requiring dedicated inference hardware - 90,000-token context at 38 tok/s is one of the first verified benchmarks for Gemma4 MoE on an RDNA4 GPU - llama.cpp's AMD backend enables this without proprietary CUDA infrastructure Enterprise-grade context lengths are no longer gated by data center hardware, and this benchmark is the clearest evidence yet.

Potential risks and opportunities

Risks

Community throughput reports lacking reproducible benchmarks could mislead enterprise buyers into deploying underperforming quant configurations at scale before quality regressions surface
If AMD's ROCm driver stack for RDNA4 has stability gaps, organizations adopting the RX 9060 XT for inference workloads could face silent failures or crashes at high context loads in production
mudler's APEX quant series is an unofficial format with no guaranteed maintenance path; any upstream Gemma4 update from Google could break compatibility without a supported migration route

Opportunities

AMD gains a concrete, community-validated inference benchmark to market the RX 9060 XT against Nvidia's RTX 4060 Ti in the growing on-premise and prosumer AI deployment segment
llama.cpp contributors and AMD ROCm engineers have a high-visibility reference workload to optimize further, with community pressure accelerating the RDNA4 inference feedback loop
Quantization toolkit developers building on the APEX-I-Compact approach can use this as a validated template for compressing other large MoE models (Mixtral, DeepSeek) to sub-16GB consumer targets

What we don't know yet

No standardized quality evaluation (MMLU, MT-Bench, or perplexity scores) was published alongside the throughput claim, leaving degradation magnitude unquantified
Whether the 38 tok/s figure holds under sustained multi-turn workloads or degrades during extended 90K-context inference sessions is untested
llama.cpp's RDNA4 ROCm and Vulkan driver stability at production scale remains unverified beyond this single community report

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Gemma4 26B-A4B Apex Quant Hits 38 tok/s at 90K Context on RX 9060 XT 16GB — Zero Quality Degradation Reported