huggingface.co web signal

Z.ai open-sources GLM-5.2, a 753B model with 1M context

TL;DR

  • Z.ai released GLM-5.2 on Hugging Face, a 753B-parameter open weights model under an MIT license with a 1M-token context window.
  • An IndexShare design reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at 1M context length.
  • Self-reported scores include 62.1 on SWE-bench Pro, 82.7 on Terminal Bench 2.1, 99.2 on AIME 2026, and 91.2 on GPQA-Diamond.

A 753-billion-parameter open weights model landing on Hugging Face would be notable on its own. What makes Z.ai's GLM-5.2 collection more interesting is the combination it ships with: a 1M-token context window, an MIT license with no regional restrictions, and architecture work aimed squarely at the cost of running long contexts in the first place.

The headline architectural move is something the model card calls IndexShare, which "reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length." That is the kind of efficiency claim that matters more than the parameter count for anyone planning to actually serve million-token prompts. Z.ai is also shipping an FP8 quantized variant alongside the base weights, and on the collection page the FP8 build is already pulling roughly 844k downloads against 133k for the base model, a tell that the deployable build is what most people came for.

The benchmark sheet leans into coding and agentic work: 62.1 on SWE-bench Pro, 82.7 on Terminal Bench 2.1, 76.8 on MCP-Atlas, plus the eye-catching 99.2 on AIME 2026 and 91.2 on GPQA-Diamond. The model card describes "stronger coding capabilities with multiple thinking effort levels to balance performance and latency," and pairs that with an improved MTP layer for speculative decoding that the authors say lifts acceptance length by up to 20%. Deployment is wired into Transformers, vLLM, SGLang, KTransformers, Unsloth, and Ascend NPU.

The honest caveat is that every benchmark number here is self-reported by Z.ai on its own model card, and the practical bar for a 753B model is whether your inference stack can feed it at all. The reporting does not detail the training data, the active-parameter count of the MoE routing, or the cost profile of actually running the 1M context window in practice. For teams building agentic coding tools or long-document workflows, though, an MIT license, a same-day FP8 release, and an architecture designed to make long context cheaper is the part of this announcement worth working through this week.

Shared on Bluesky by 2 AI experts