huggingface.co web signal

Qwopus-3.6-35B coder merge claims 62.4% on SWE-bench slice

TL;DR

  • The author reports 62.4% on a 300-case SWE-bench run with thinking disabled, using the Q5_K_M quantization.
  • GGUF quantizations run from Q2_K at 13.2 GB up to Q8_0 at 37.8 GB, targeting single-GPU local deployment.
  • The model is a Qwen3.6-35B-A3B merge with roughly 3B active parameters, tuned for Codex, OpenHands and Claude Code-style agent loops.

Something interesting keeps surfacing in the community-merge scene, and this week it is Qwopus-3.6-35B-A3B-Coder, a thinking-off coding model shipped as GGUF quants on Hugging Face. The pitch runs against the current fashion for longer visible reasoning. The author tuned a Qwen3.6-35B-A3B base to make faster tool-calling decisions with fewer tokens spent narrating internal monologue, aimed at repeated agent loops that read files, pick tools, edit code, run tests and iterate.

The headline number is a claimed 62.4% on a 300-case SWE-bench run at the Q5_K_M quantization with thinking disabled. Take that as reported, not settled: it is a slice, it is self-run, and the model card does not describe how cases were selected or how patches were graded. The card also publishes a behavioural scorecard against a thinking-on model it calls Ornith-1.0. Qwopus wins on legit-request compliance (100 vs 70) and multi-turn orchestration (80 vs 70). Ornith holds a lead on engineering competence (94 vs 81) and context-poison resistance (85 vs 70). The average is 82.1 vs 78.9, which is the kind of narrow gap that only matters if the axes are the ones you care about.

Why any of this matters if you are not merging models yourself: the local-coding-agent stack has been waiting for a 35B-class hybrid sparse MoE with roughly 3B active parameters that behaves stably inside Codex, OpenHands and Claude Code-style loops without burning tokens on internal reasoning. Quantizations run from Q2_K at 13.2 GB up to Q8_0 at 37.8 GB, with Q5_K_M at 25.3 GB sitting in the range a single consumer GPU can host. The license is Apache 2.0.

The honest caveat is that the benchmark evidence is single-sourced and there are no latency numbers, no tokens-per-task figures, and no direct comparison to the paid frontier reasoning models that most teams actually weigh against. What the model card does not tell you is enough about the SWE-bench harness setup to reproduce the score cleanly. If it holds up under third-party runs, the people who benefit are the ones building local agent harnesses who value throughput and tool-call stability more than benchmark-topping reasoning.