reddit.com via Reddit May 22nd 2026

llama.cpp b9274 fixes MTP VRAM leak crashing models

open source inference llama-cpp mtp vram local-llm bug-fix

Key insights

llama.cpp build b9274 fixes a VRAM leak in MTP's draft allocation path that caused silent model unloads on some GPUs.
The patch follows b9254's text-generation regression fix earlier the same week, indicating MTP stabilization is ongoing.
Users running Qwen 3.6 or other MTP-capable models on affected hardware should update immediately to prevent mid-session crashes.

Why this matters

Multi-Token Prediction is one of the most consequential throughput improvements for local inference, and silent model unloads with no error output create a failure mode that is nearly impossible for end users to diagnose without GPU monitoring tools. The back-to-back patch releases within a single week signal that MTP's integration surface across consumer GPU configurations is larger than initially scoped, which matters for anyone building production pipelines or evaluation harnesses on top of llama.cpp. Teams shipping local inference products on heterogeneous hardware should treat MTP as still in stabilization and pin build versions explicitly until the regression cadence slows.

Summary

llama.cpp build b9274 patches a VRAM memory leak in the Multi-Token Prediction draft resource allocation path that was silently unloading models mid-session on affected hardware. The bug hit users running MTP-capable models like Qwen 3.6, where GPU memory would accumulate unchecked until the runtime forcibly ejected the model, often after just a few minutes of active use. The fix lands two days after build b9254 addressed a separate text-generation regression, marking a concentrated stabilization push as MTP support matures across diverse GPU configurations. Essentially: (llama.cpp maintainers, Qwen 3.6 users) are working through the integration debt that comes with shipping a complex inference feature across heterogeneous consumer and prosumer hardware. - Build b9274 targets the draft resource allocation path specifically, where MTP pre-allocates VRAM for speculative tokens without releasing it correctly on some configurations. - The back-to-back patch cadence (b9254, b9274) suggests MTP's stabilization is still active, not concluded. - Users on affected setups were getting no error message, just a silent model unload, making diagnosis difficult without monitoring VRAM directly. For the local inference community, rapid patch cycles like this are the cost of early access to frontier inference techniques on hardware the developers cannot fully anticipate.

Potential risks and opportunities

Risks

Teams running automated evaluation pipelines or multi-user inference servers on pre-b9274 builds face silent throughput degradation as models unload and reload without triggering visible alerts.
Qwen 3.6 adopters who pinned an earlier build for stability could remain exposed for weeks if their update cadence follows manual release reviews rather than automated dependency tracking.
If additional MTP allocation bugs surface post-b9274, the rapid patch cadence could erode trust in MTP as a production-ready feature, slowing adoption precisely as hardware support broadens.

Opportunities

GPU monitoring and local inference observability tools (e.g., nvtop-based dashboards, custom VRAM alerting scripts) gain clear user demand as silent unloads expose the gap in local LLM runtime visibility.
Managed local inference platforms (Jan, LM Studio, Msty) can differentiate by shipping automated build-pinning and regression alerts, reducing the manual tracking burden this patch cycle exposes.
Model vendors targeting the local inference market, particularly those releasing MTP-capable weights, can build goodwill by publishing hardware compatibility matrices and minimum recommended llama.cpp build numbers alongside model releases.

What we don't know yet

Which specific GPU vendors or driver versions trigger the leak, and whether the fix covers all affected configurations or only the reported subset.
Whether b9274 fully resolves MTP instability or if additional allocation path issues remain unpatched as of May 22, 2026.
Whether downstream projects and frontends (Ollama, LM Studio, llama-server wrappers) have already pulled b9274 or are still shipping the leaking build.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: llama.cpp Build b9274 Patches Critical MTP VRAM Memory Leak Causing Spontaneous Model Unloads