scmp.com web signal

Zhipu AI's GLM-5.2 Outscores GPT-5.5 on Coding Benchmarks

china ai open source coding tools china-ai open-weight-models coding

TL;DR

  • GLM-5.2 scored 62.1 on SWE-bench Pro, outperforming GPT-5.5's 58.6, and ranks as the top open-weights model on Artificial Analysis's Intelligence Index.
  • Zhipu's GLM Coding Plan subscription is reportedly priced at a tenth of Anthropic's Claude Code and Claude Max tiers.
  • Former Meta and Google DeepMind VP Matt Velloso called GLM-5.2 the first open model reliable enough to use as a daily coding driver.

Beijing-based Zhipu AI, known internationally as Z.ai, released GLM-5.2 on June 13, and South China Morning Post reports it has drawn comparisons to the original DeepSeek release from roughly 18 months ago. The reaction from Silicon Valley practitioners has been unusually direct. Matt Velloso, a former vice-president at Meta Platforms and Google DeepMind, said he had been using GLM-5.2 "all day" and called it "the first open model that passes the bar as a daily driver," adding it was "more to the point" than GPT-5.5 and "doesn't talk too much, doesn't go in circles trying to explain itself, just does the job."

The benchmark numbers give that endorsement some grounding. GLM-5.2 scored 62.1 on SWE-bench Pro against GPT-5.5's 58.6, and landed within one percentage point of Claude Opus 4.8 on FrontierSWE, a benchmark measuring long-horizon task completion. The model's context window expanded from 200,000 to one million tokens, with Z.ai specifically targeting long-context training. Independent benchmarker Artificial Analysis ranks it the top open-weights model globally and fourth overall on its Intelligence Index, behind Claude Fable 5, Opus 4.8, and GPT-5.5.

The cost dimension is what turns a benchmark win into a structural story. Zhipu's GLM Coding Plan subscription is reportedly priced at a tenth of Anthropic's Claude Code and Claude Max tiers, and the model is released under an MIT license, meaning teams can self-host without licensing fees. That pairing echoes what made the original DeepSeek release disruptive: competitive performance at a fraction of the cost.

The honest caveat is that a practitioner endorsement and a benchmark score are not the same as verified enterprise reliability. What the reporting does not settle is how much hardware self-hosting actually requires, whether the SWE-bench Pro score has been independently reproduced, or whether teams in regulated industries can use the Z.ai API without data routing concerns.

Small development teams priced out of the major frontier subscriptions are the clearest near-term beneficiaries if the performance holds up in production.