huggingface.co via Reddit

Unsloth Quantizes GLM-5.2's 1.51TB to 217GB for Local Inference

open source inference open source inference

TL;DR

  • Unsloth AI released GGUF quantizations of GLM-5.2 (754B params), compressing the 1.51 TB BF16 model to 217 GB at 1-bit.
  • GLM-5.2 scores 99.2 on AIME 2026 and 91.2 on GPQA-Diamond per the Hugging Face model card.
  • The quantizations support llama.cpp, Ollama, LM Studio, and vLLM under the base model's MIT license.

Running a 754-billion-parameter model on local hardware has been, for most practitioners, a hypothetical. The full-precision BF16 version of GLM-5.2 weighs in at 1.51 TB, which puts it firmly in the datacenter tier. Unsloth AI's GGUF release changes the arithmetic: a 2-bit quantized variant comes in at 238 GB, and the 1-bit version reaches 217 GB, making the model at least theoretically runnable on a multi-GPU setup without a cloud contract.

The model being compressed is capable by available benchmarks. According to the model card, GLM-5.2 scores 99.2 on AIME 2026 and 91.2 on GPQA-Diamond, with a 62.1 on SWE-bench Pro. The architecture also includes IndexShare, which the model card says reduces per-token FLOPs by 2.9x at 1 million token context length, and an MTP layer that reportedly improves speculative decoding acceptance length by up to 20%.

What makes the release practically useful is breadth of tooling. The quantizations drop into llama.cpp, Ollama, LM Studio, and vLLM immediately, and the MIT license on the base zai-org/GLM-5.2 model means there are no commercial use restrictions layered on top.

The honest caveat is that 217 GB still demands serious hardware, and the model card does not provide minimum hardware specifications or per-quantization accuracy degradation figures. The gap between a full BF16 reference and a 1-bit compression of a 754B model is not trivial, and teams evaluating this for production use will need to benchmark the quantized variants against their own tasks rather than assuming parity with the full-precision scores.

For organizations with the hardware, this opens a real path to self-hosting. Research labs, universities with GPU clusters, and enterprises with data-residency requirements now have a frontier-tier reasoning model available locally under an MIT license, with a range of quantization options from 217 GB up to 801 GB for Q8_0 that lets teams tune the capability-cost tradeoff incrementally.

Shared on Bluesky by 1 AI expert

  • Ted Underwood @tedunderwood.com amplified

    @danielesalatti.com

    @unsloth.ai published quantized versions of GLM 5.2 but they haven’t announced it here yet huggingface.co/unsloth/GLM-...

    View on Bluesky →