reddit.com via Reddit May 18th 2026

llama.cpp b9200 MTP Flags Bottleneck Qwen3 27B on 3090

inference open source inference local-llm

Key insights

llama.cpp b9200 introduced a misconfiguration gap where unsloth's recommended Qwen 3.6 27B MTP flags actively reduce tokens-per-second output.
The Hermes agent backend sees throughput restored when specific corrected flags replace the unsloth defaults on b9200.
RTX 3090 users running Qwen 3.6 27B in coding-agent workflows are the confirmed affected population, with before/after benchmark data available.

Why this matters

The b9200 llama.cpp update was positioned as a performance gain, but the unsloth MTP flag mismatch means practitioners who followed standard guidance are running below baseline without knowing it. Coding-agent workflows dependent on sustained throughput are directly affected, making this a reliability issue for teams using Qwen 3.6 27B in production inference pipelines. The incident illustrates a compounding risk in the local inference stack: upstream updates, model-specific configurations, and hardware-layer tuning can interact in ways that standard documentation fails to cover.

Summary

The b9200 llama.cpp update has a configuration trap: default MTP flags from unsloth bottleneck Qwen 3.6 27B on RTX 3090, cutting into gains the update advertised. A r/LocalLLaMA benchmark documents before/after tokens-per-second in a production coding-agent workflow on the Hermes backend, with corrected flag settings that restore throughput. Essentially: (llama.cpp, unsloth, Qwen3) the upstream update, model-specific MTP defaults, and the model broke in combination. - Default unsloth MTP flags reduce Qwen 3.6 27B throughput after the b9200 upgrade - Corrected Hermes backend flags published in the thread restore expected performance - RTX 3090 is the confirmed affected hardware tier Practitioners who upgraded and see MTP underperformance should audit flags before assuming hardware limits.

Potential risks and opportunities

Risks

Practitioners relying on unsloth's default MTP flags after upgrading to b9200 may run Qwen 3.6 27B coding agents at degraded throughput for weeks if the community post goes unseen
Teams benchmarking b9200 against advertised gains and seeing underperformance may incorrectly attribute the issue to hardware or model quality, prompting unnecessary hardware upgrades or model switches
If the flag mismatch extends to other Qwen model sizes or MTP-capable models beyond Qwen 3.6 27B, the affected user base could be substantially larger than current reporting suggests

Opportunities

llama.cpp front-end tooling projects (LM Studio, Ollama, Jan) can differentiate by surfacing flag validation warnings when users upgrade to b9200 with Qwen MTP configurations
unsloth has an opening to publish a b9200-specific tuning guide for Qwen 3.6 27B that re-establishes their documentation as the authoritative source for MTP configuration on updated llama.cpp builds
Benchmark infrastructure providers and local inference hosting services can attract RTX 3090 users by offering pre-tuned b9200 configurations for Qwen 3.6 27B coding-agent workflows

What we don't know yet

Whether unsloth has acknowledged the flag mismatch and plans to update official Qwen 3.6 27B MTP recommendations specifically for the b9200 release
Whether the throughput regression affects GPU tiers beyond RTX 3090, such as RTX 4090 or A100-class hardware running the same MTP configuration
Exact magnitude of the throughput gap: before/after figures are referenced in the post but not surfaced in available public summaries

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: b9200 llama.cpp Update Benchmarked — Default MTP Flags Bottleneck Qwen 3.6 27B on RTX 3090, Tuned Settings Restore Throughput