reddit.com via Reddit

llama.cpp b9200 MTP Flags Bottleneck Qwen3 27B on 3090

inference open source inference local-llm

Key insights

  • llama.cpp b9200 introduced a misconfiguration gap where unsloth's recommended Qwen 3.6 27B MTP flags actively reduce tokens-per-second output.
  • The Hermes agent backend sees throughput restored when specific corrected flags replace the unsloth defaults on b9200.
  • RTX 3090 users running Qwen 3.6 27B in coding-agent workflows are the confirmed affected population, with before/after benchmark data available.

Why this matters

The b9200 llama.cpp update was positioned as a performance gain, but the unsloth MTP flag mismatch means practitioners who followed standard guidance are running below baseline without knowing it. Coding-agent workflows dependent on sustained throughput are directly affected, making this a reliability issue for teams using Qwen 3.6 27B in production inference pipelines. The incident illustrates a compounding risk in the local inference stack: upstream updates, model-specific configurations, and hardware-layer tuning can interact in ways that standard documentation fails to cover.

Summary

The b9200 llama.cpp update has a configuration trap: default MTP flags from unsloth bottleneck Qwen 3.6 27B on RTX 3090, cutting into gains the update advertised. A r/LocalLLaMA benchmark documents before/after tokens-per-second in a production coding-agent workflow on the Hermes backend, with corrected flag settings that restore throughput. Essentially: (llama.cpp, unsloth, Qwen3) the upstream update, model-specific MTP defaults, and the model broke in combination. - Default unsloth MTP flags reduce Qwen 3.6 27B throughput after the b9200 upgrade - Corrected Hermes backend flags published in the thread restore expected performance - RTX 3090 is the confirmed affected hardware tier Practitioners who upgraded and see MTP underperformance should audit flags before assuming hardware limits.

Potential risks and opportunities

Risks

  • Practitioners relying on unsloth's default MTP flags after upgrading to b9200 may run Qwen 3.6 27B coding agents at degraded throughput for weeks if the community post goes unseen
  • Teams benchmarking b9200 against advertised gains and seeing underperformance may incorrectly attribute the issue to hardware or model quality, prompting unnecessary hardware upgrades or model switches
  • If the flag mismatch extends to other Qwen model sizes or MTP-capable models beyond Qwen 3.6 27B, the affected user base could be substantially larger than current reporting suggests

Opportunities

  • llama.cpp front-end tooling projects (LM Studio, Ollama, Jan) can differentiate by surfacing flag validation warnings when users upgrade to b9200 with Qwen MTP configurations
  • unsloth has an opening to publish a b9200-specific tuning guide for Qwen 3.6 27B that re-establishes their documentation as the authoritative source for MTP configuration on updated llama.cpp builds
  • Benchmark infrastructure providers and local inference hosting services can attract RTX 3090 users by offering pre-tuned b9200 configurations for Qwen 3.6 27B coding-agent workflows

What we don't know yet

  • Whether unsloth has acknowledged the flag mismatch and plans to update official Qwen 3.6 27B MTP recommendations specifically for the b9200 release
  • Whether the throughput regression affects GPU tiers beyond RTX 3090, such as RTX 4090 or A100-class hardware running the same MTP configuration
  • Exact magnitude of the throughput gap: before/after figures are referenced in the post but not surfaced in available public summaries