artificialanalysis.ai web signal

Zhipu AI's GLM-5.2 Tops Open-Weights Intelligence Index With Score of 51

open source china ai open-weights ai-benchmarks model-release

TL;DR

  • GLM-5.2 scores 51 on the Artificial Analysis Intelligence Index v4.1, beating MiniMax-M3 and DeepSeek V4 Pro max, both at 44.
  • The 744B-parameter model improved Humanity's Last Exam by 12 points to 40% and SciCode by 7 points to 50%.
  • At 43k output tokens per benchmark task, GLM-5.2 trades efficiency for capability, priced at $4.4 per million output tokens.

Zhipu AI's GLM-5.2 has taken the top position among open-weights models on the Artificial Analysis Intelligence Index v4.1, scoring 51 against MiniMax-M3 and DeepSeek V4 Pro max, both at 44. That seven-point margin on a benchmark where leading proprietary systems still hold the overall top ranks makes it a meaningful result for a model with freely usable weights.

The architecture behind the score is 744B total parameters with 40B active parameters per inference call, the same footprint as its predecessor GLM-5.1, but the context window has expanded from 200K to 1M tokens. The capability gains are concentrated in scientific reasoning: CritPt improved 16 points to 21%, Humanity's Last Exam climbed 12 points to 40%, and SciCode rose 7 points to 50%. On the GDPval-AA v2 metric, GLM-5.2 reached 1524, landing in comparable territory to proprietary GPT-5.5 at 1514 on the same measure.

The honest caveat is token efficiency. According to Artificial Analysis, GLM-5.2 uses 43k output tokens per Intelligence Index task, of which 37k is reasoning, which positions it off the most attractive efficiency-versus-capability quadrant. At $1.4 per million input tokens and $4.4 per million output tokens, that token usage means inference costs scale quickly for any team running the model at volume.

For developers who need frontier-class scientific reasoning and can absorb that cost, the MIT license and availability on providers like DeepInfra and Fireworks remove the proprietary access barrier. The AA-Omniscience index also improved, with accuracy rising to 25.1% and hallucinations falling to 28.1%. What the benchmark data does not tell you is how the model performs on real-world task mixes outside the evaluation suite, and whether the efficiency trade-off narrows or widens as production workloads diverge from benchmark conditions. The teams for whom this result matters most are probably those already running scientific or long-context workloads where capability at any cost beats a cheaper but weaker alternative.