huggingface.co web signal

NVIDIA posts NVFP4 GLM-5.2, a 753B MoE built for Blackwell

TL;DR

  • NVIDIA published an NVFP4 4-bit quantization of ZAI's GLM-5.2, a Mixture-of-Experts model with 753B parameters in total and 40B activated.
  • Across GPQA Diamond, SciCode, IFBench, AA-LCR and τ²-Bench Telecom, the NVFP4 build stays within a point of the FP8 baseline on NVIDIA's numbers.
  • The model is tested on NVIDIA B200 and B300, supports a 1M token context, and inherits the MIT license from the base model.

NVIDIA quietly pushed a 4-bit version of ZAI's GLM-5.2 to Hugging Face this week, and the interesting bit is not that the file is smaller, it's that the accuracy barely moved. On the model card, the NVFP4 build lands within a fraction of a point of the FP8 baseline across five evaluations, and on three of them it actually edges ahead: IFBench at 75.81 versus 74.95, AA-LCR at 70.13 versus 69.38, and τ²-Bench Telecom at 98.25 versus 97.9. GPQA Diamond and SciCode each drop very slightly. Take those deltas as reported by NVIDIA, not as independently replicated.

GLM-5.2 itself is a Mixture-of-Experts model with, in NVIDIA's own wording, 753B parameters in total and 40B activated, a context length up to 1M, and a sparse attention setup the card calls an IndexShare indexer. The contribution here is the post-training quantization, done with NVIDIA's Model Optimizer, which compresses the weights and activations of the linear operators inside the MoE expert blocks down to the NVFP4 data type. The shared expert is left unquantized. The card lists NVIDIA B200 and B300 as the test hardware and points users at SGLang and vLLM as the supported runtimes.

For people who actually run these things in production, the practical reading is that a frontier-scale open-weight reasoning model which had been living in FP8 territory now has a vendor-blessed FP4 path on current NVIDIA silicon, with the MIT license carried over from the base model. That widens the set of teams who can credibly stand up a 753B MoE without renting half a rack.

The honest caveat is that the published numbers are NVIDIA's own, evaluated at temperature 1.0 and top-p 0.95, and the card does not give throughput, memory footprint, or minimum-GPU-count comparisons against the FP8 build. It also does not explain why the shared expert is held back from quantization. Those are the questions to push on before trusting the recipe on your workload.

If FP4 inference holds up beyond NVIDIA's benchmark suite, the people who win are the ones who want to serve large agentic MoEs on fewer Blackwell GPUs than the FP8 build would need, which is hyperscalers buying B200s by the rack and the smaller shops renting them by the hour.

Shared on Bluesky by 1 AI expert