reddit.com via Reddit May 18th 2026

Luce DFlash Triples Prefill Speed on AMD RX 7900 XTX

amd inference open source local-llm amd-gpu inference benchmarks

Key insights

Luce's DFlash + PFlash PR achieves 2.24x decode and 3.05x prefill gains over stock llama.cpp HIP on a single AMD RX 7900 XTX.
This is the first community-reproduced benchmark of the optimization on Navi 31 silicon, lending independent credibility to the claimed speedups.
The optimization remains a pending PR against llama.cpp HIP and is not yet available in any official release.

Why this matters

AMD GPU owners running large local models have historically accepted significantly lower tokens-per-second than equivalent Nvidia hardware on llama.cpp, and a 2-3x closure on that gap changes the cost-per-token calculus for prosumer builders choosing between platforms. For founders building inference infrastructure on commodity hardware, validated AMD parity reduces vendor lock-in on Nvidia and expands the addressable pool of affordable compute. For the open-source inference ecosystem, a community-reproduced result this strong creates pressure on the llama.cpp maintainers to merge or formally evaluate the PR, which could shift AMD's standing in local inference benchmarks overnight.

Summary

AMD's prosumer GPU camp just got a meaningful performance boost for local LLM inference. A community benchmark on a single RX 7900 XTX shows Luce's DFlash + PFlash pull request (#119) delivering 2.24x faster decode and 3.05x faster prefill over baseline llama.cpp HIP when running Qwen3.6-27B, and this is the first published user reproduction of the optimization on Navi 31 hardware. The gap between AMD and Nvidia on llama.cpp has long been a friction point for users who chose AMD for cost or availability reasons. Luce's optimization targets the memory access patterns that typically bottleneck attention computation, and the numbers suggest it closes a substantial portion of that historical deficit without requiring new hardware. Essentially: (Luce, AMD) close the local inference gap on prosumer hardware. - 2.24x decode speedup and 3.05x prefill speedup measured on a single RX 7900 XTX running Qwen3.6-27B - The result is a community reproduction, not a vendor benchmark, with full hardware specs and methodology posted - The optimization ships as a PR (#119) against llama.cpp HIP, meaning it requires building from source or waiting for a merge If the PR lands in mainline llama.cpp, a large installed base of AMD GPU owners running 20B-plus parameter models stands to benefit without any hardware change.

Potential risks and opportunities

Risks

If the PR is not merged into mainline llama.cpp, AMD users must indefinitely maintain a fork, creating fragmentation and a support burden that could erode community adoption
Benchmark reproducibility risk: a single community result on one unit of Navi 31 hardware may not generalize across driver versions or VRAM configurations, and overclaiming could backfire if follow-up tests show narrower gains
Nvidia could respond by accelerating ROCm-targeting optimizations in competing runtimes (e.g., vLLM, ExLlamaV2), narrowing AMD's newly claimed advantage before it reaches mainstream users

Opportunities

AMD could cite this community result in developer relations outreach to accelerate enterprise and prosumer adoption of RX 7900-series cards for local inference workloads
Inference runtime projects targeting AMD (ExLlamaV2, MLC-LLM, Ollama) can port or adapt the DFlash/PFlash technique to capture the same gains and differentiate on AMD hardware support
Prosumer PC builders and small inference hosting operations can act now by building from the PR branch to achieve 2-3x throughput improvement on existing RX 7900 XTX hardware at zero additional cost

What we don't know yet

Whether Luce's PR #119 has been reviewed or acknowledged by llama.cpp core maintainers, and what the merge timeline looks like as of May 2026
Whether the 2.24x and 3.05x gains hold across other large models (e.g., Llama 3 70B, Mistral 22B) or are specific to Qwen3.6-27B's architecture
Whether similar DFlash/PFlash optimizations are applicable to other AMD Navi architectures below the 7900 XTX, such as RX 7800 XT or RX 7600

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Luce DFlash + PFlash on RX 7900 XTX Delivers 2.24× Decode and 3.05× Prefill Over llama.cpp HIP on Qwen3.6-27B