reddit.com via Reddit

Qwen3.6-35B-A3B Tops Gemini 2.5 Pro on Terminal-Bench 2.0

alibaba open source inference open-source-models coding-benchmarks inference

Key insights

  • Qwen3.6-35B-A3B scored 24.6% on Terminal-Bench 2.0, surpassing Gemini 2.5 Pro CLI at 19.6% and Qwen3-Coder-480B at 23.9%.
  • The model uses mixture-of-experts architecture activating only ~3B parameters per pass, enabling practical local inference on modest hardware.
  • Terminal-Bench 2.0 evaluates 3-hour autonomous terminal tasks on 32 CPUs and 48 GB RAM, not single-turn code generation.

Why this matters

For AI practitioners evaluating models for agentic coding pipelines, this result challenges the assumption that frontier API access is required to achieve competitive autonomous task performance, since a locally runnable MoE model now leads on a real-execution benchmark. Founders building agent infrastructure should reassess cost models: if a 3B-active-parameter model outperforms 480B dense alternatives on sustained agentic work, the economics of self-hosted agent deployments shift substantially. Technical leaders at companies standardizing on Gemini CLI for developer tooling now have a credible open-weight alternative with a published, third-party-verified benchmark score to evaluate against.

Summary

A 35-billion-parameter mixture-of-experts model from Alibaba's Qwen team just outscored both Google's Gemini 2.5 Pro and Qwen's own 480B-parameter coder model on Terminal-Bench 2.0, a benchmark designed to measure real agentic execution over multi-hour autonomous terminal sessions. Qwen3.6-35B-A3B posted 24.6% (±3.2) on the benchmark's 3-hour autonomous task setting, running against a harness that allocates 32 CPUs and 48 GB RAM per evaluation run. Gemini 2.5 Pro via Gemini CLI scored 19.6%, and Qwen3-Coder-480B on Terminus 2 scored 23.9% — both below the much smaller MoE model. Essentially: (Alibaba Qwen, Google DeepMind) are now competing on agentic execution benchmarks where active parameter count matters more than total model size. - Qwen3.6-35B-A3B activates roughly 3B parameters per forward pass, making local deployment feasible on consumer or prosumer hardware. - Terminal-Bench 2.0 measures sustained autonomous coding and shell execution, not single-turn generation, which is a closer proxy for real-world agent utility. - The result was confirmed by the benchmark co-author, adding credibility beyond self-reported claims. The broader implication is that MoE architecture efficiency is now directly translating to agentic benchmark leadership, compressing the gap between frontier API models and locally runnable open-weight alternatives.

Potential risks and opportunities

Risks

  • Google faces reputational pressure in the developer-tooling market if Gemini CLI continues scoring below open-weight local models on agentic benchmarks over the next 60-90 days without a response submission.
  • Qwen3-Coder-480B customers and API providers who positioned the model as a coding leader now have a directly contradicting public benchmark result from the same benchmark author.
  • Terminal-Bench 2.0 results could be gamed by scaffold tuning rather than model improvement, undermining leaderboard credibility if submissions without reproducible run configs proliferate.

Opportunities

  • Local inference hardware vendors (Lambda Labs, Vast.ai) and quantization tooling providers (llama.cpp, unsloth) can market directly to agent developers now that a top-leaderboard model fits consumer-grade GPU clusters.
  • Agent framework builders (LangChain, CrewAI, AutoGen) have a concrete benchmark to target for integration showcases, and Qwen3.6-35B-A3B becomes the default open-weight reference model for agentic evals.
  • Enterprise AI teams evaluating on-premise agent deployments gain a vendor-independent justification for self-hosted infrastructure over API-dependent Gemini CLI setups, strengthening negotiating leverage with Google on pricing.

What we don't know yet

  • Whether Qwen3.6-35B-A3B's 24.6% score holds across Terminal-Bench 2.0 task categories or is concentrated in specific shell/coding subtasks not yet broken out publicly.
  • Whether Google has submitted updated Gemini 2.5 Pro results under optimized agent scaffolding, since the 19.6% score may reflect Gemini CLI defaults rather than peak model capability.
  • What hardware and quantization configuration was used for the Qwen3.6-35B-A3B run, which would determine whether the local-inference efficiency claim translates to consumer GPU setups.