reddit.com via Reddit May 21st 2026

llama.cpp b9254 restores TG speed, adds Hopper GPU flag

open source inference chips open-source-ai local-inference

Key insights

llama.cpp b9254 fixes a token-generation regression affecting both MTP and non-MTP models across multiple prior builds.
A new PDL cmake flag unlocks overlapping kernel scheduling on NVIDIA Hopper and Blackwell GPUs for additional latency gains.
One community benchmark recorded a 3% throughput uplift on dual RTX 5060 Ti 16GB cards in tensor-split mode after upgrading.

Why this matters

Local inference practitioners running llama.cpp on multi-GPU setups have been silently losing throughput across recent builds without a clear regression signal, making b9254 a critical update for anyone benchmarking or deploying production-adjacent local models. The PDL flag introduction is strategically significant because it unlocks a GPU scheduling capability that NVIDIA has been promoting for AI workloads on Hopper and Blackwell, and llama.cpp is now one of the first open inference runtimes to expose it at the build level. For founders and technical leaders evaluating on-premise inference hardware, the combination of regression resolution and Blackwell-specific optimization in a single build accelerates the case for RTX 5000-series or H-series deployments over cloud API spend.

Summary

llama.cpp build b9254 has resolved a token-generation regression that had been degrading inference throughput across multiple recent builds, with at least one community user reporting a 3% throughput gain on dual RTX 5060 Ti 16GB cards running in tensor-split mode. The fix covers both MTP (Multi-Token Prediction) and non-MTP model configurations, meaning the regression was broad enough to affect most local inference setups. Beyond the regression fix, the build ships a new cmake flag enabling PDL (Programmatic Dependent Launch) for NVIDIA Hopper and Blackwell GPU architectures. PDL allows overlapping kernel scheduling, which reduces latency by letting the GPU pipeline dependent compute operations instead of waiting for full kernel completion. Essentially: (ggerganov/llama.cpp community, NVIDIA Hopper and Blackwell GPU users) are the key actors here. - Build b9254 restores token-generation rates lost over several prior builds, confirmed on dual RTX 5060 Ti 16GB in tensor-split mode. - The new PDL cmake flag is opt-in and targets Hopper (H100/H200) and Blackwell (B100/B200/RTX 5000-series) hardware only. - PDL-enabled scheduling can deliver latency improvements beyond the regression fix, though gains are hardware-dependent and not yet systematically benchmarked. For the local inference ecosystem, build-to-build regressions in llama.cpp are a recurring friction point that slow adoption of newer model features without any capability payoff.

Potential risks and opportunities

Risks

Users who pinned llama.cpp to a pre-regression build may be unaware of cumulative fixes in b9254 and continue benchmarking against a degraded baseline, skewing hardware purchasing decisions.
The PDL flag, if enabled on non-Hopper/Blackwell hardware or misconfigured, could introduce new instability; no formal QA matrix has been published covering all supported GPU generations.
Build-to-build throughput volatility in llama.cpp could push latency-sensitive production users toward more stable but less capable inference runtimes, fragmenting the open-source ecosystem around competing forks.

Opportunities

NVIDIA can use PDL adoption in llama.cpp as a concrete benchmark reference for Hopper and Blackwell datacenter sales, particularly against AMD MI300X deployments where ROCm lacks an equivalent scheduling primitive.
Inference optimization vendors (Nomic, LM Studio, Jan.ai) that ship llama.cpp under the hood can differentiate by fast-tracking b9254 integration and publishing PDL benchmark comparisons before competitors.
Cloud GPU rental providers (RunPod, Vast.ai, Lambda Labs) offering Hopper instances gain a concrete upsell hook: PDL-enabled llama.cpp throughput gains justify premium H100/H200 tier pricing over consumer-grade alternatives.

What we don't know yet

Which specific builds introduced the TG regression and whether a root-cause postmortem has been documented in the llama.cpp issue tracker.
Quantified PDL latency gains across different Hopper and Blackwell SKUs (H100 SXM vs. PCIe, B200 vs. RTX 5090) remain unreported as of this build.
Whether the tensor-split regression fix also covers ROCm and Metal backends or is limited to CUDA paths.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: llama.cpp Build b9254 Fixes TG Regression Across MTP and Non-MTP Models, Adds PDL Flag for NVIDIA Hopper and Blackwell