Deep Learning: a 744B open model just lapped the labs

GLM-5.2 took the open-weights crown under MIT. A subquadratic-attention model claimed a 12M-token context. A 3B reasoner matched flagship models on AIME.


The architecture story this week is concentration breaking up. A 744B-parameter MIT-licensed model from Beijing is now the strongest open-weights system on the Artificial Analysis Intelligence Index, and it sits second on a live coding leaderboard behind only the export-controlled Claude Fable 5. Underneath it, two efficiency bets matured: a subquadratic-attention model claiming a 12M-token context, and a Rust GPU-kernel stack that matches vLLM throughput while ruling out data races at compile time. The frontier is still proprietary — but the gap a practitioner can self-host has not been this thin.


Key Takeaways

The open-weights frontier is one model behind the closed one. GLM-5.2 (744B total / 40B active MoE) scores 51 on the Artificial Analysis Intelligence Index and leads all open-weights models — ahead of MiniMax-M3 and DeepSeek V4 Pro (both 44) — while ranking 2nd on Code Arena WebDev behind only Claude Fable 5, the model Washington just export-controlled. The thing you can download is now the runner-up.

Subquadratic attention shipped a benchmark, not a slide. SubQ 1.1 Small claims linear-scaling attention via Subquadratic Sparse Attention, 98% needle-in-a-haystack at 12M tokens, and 56× over FlashAttention-2 on a single attention layer — pending independent replication.

Memory safety reached the GPU kernel without a throughput tax. NVIDIA's cuTile Rust hits 96% of cuBLAS GEMM on a B200 while extending Rust's ownership model to tile kernels — data-race-free by construction.

Parameter count is decoupling from reasoning on verifiable tasks. VibeThinker-3B (3B parameters) hits 94.3 on AIME 2026 and 80.2 Pass@1 on LiveCodeBench v6, claiming parity with flagship models many times its size — a lift the authors attribute to post-training, not scale.

On-device inference is now a first-class Apple framework, not a hack. Core AI replaces Core ML for transformers, spanning 3B–70B models with AOT compilation and a zero-copy Swift API across CPU/GPU/Neural Engine.

The Big Story

GLM-5.2 becomes the leading open-weights model on the Artificial Analysis Intelligence Index · Artificial Analysis · 2026-06-17
Z.ai's GLM-5.2 — a 744B-total, 40B-active sparse MoE with a 1M-token context — scores 51 on the Intelligence Index and leads every open-weights system, ahead of MiniMax-M3 and DeepSeek V4 Pro (both 44) and Kimi K2.6 (43). The architectural lesson for anyone training MoE models: at the same active-parameter inference cost, the capability headroom this cycle came from the post-training mix, not a bigger parameter budget. The catch is what you do with it: its MIT license and roughly $1.40/$4.40 per-million pricing (per Simon Willison) buy you the weights, but a 753B-parameter, 1.51TB file is a heavy thing to stand up and serve yourself.


Also This Week

Subquadratic ships SubQ 1.1 Small with linear-scaling attention and a 12M-token context · Subquadratic · 2026-06-16
Subquadratic Sparse Attention compresses attention to just 0.13% of token relationships yet holds 98% needle-in-a-haystack retrieval at 12M tokens, with a claimed 56× speedup over FlashAttention-2 and 64.5× less compute than dense attention at 1M tokens — if it survives independent replication, it reframes long context as a sparsity-routing problem rather than a hardware-bandwidth one.

Apple's Core AI replaces Core ML as the on-device framework for transformers · InfoQ · 2026-06-20
Core AI runs 3B–70B reasoning models entirely on Apple Silicon with ahead-of-time compilation and a zero-copy, memory-safe Swift API unifying CPU, GPU, and Neural Engine — so on-device quantized inference stops being a llama.cpp side-project and becomes the sanctioned deployment path with no per-token cloud cost.

Z.ai's open weights now trail only Claude Fable 5 on a live coding leaderboard · Simon Willison · 2026-06-17
Willison's hands-on read puts GLM-5.2 2nd on Code Arena WebDev behind only Claude Fable 5 — meaning the single best frontier coding model on that board is the one Washington just restricted, and the practical alternative is a 753B file you can pull from Hugging Face today.


From the Lab

Fearless Concurrency on the GPU (cuTile Rust) · arXiv · 2026-06-14
NVIDIA extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into provably disjoint tiles, so data races become a compile error, not a runtime heisenbug. On a B200 it hits 7 TB/s elementwise and 2 PFlop/s GEMM (96% of cuBLAS), with a paired engine reaching 171 tok/s (Qwen3-4B, RTX 5090) and 82 tok/s (Qwen3-32B, B200) at batch-1 decode — competitive with vLLM and SGLang. For anyone writing custom attention or quantization kernels, it is the first credible argument that memory safety and roofline-bound throughput are not mutually exclusive.

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models · arXiv · 2026-06-15
A "Spectrum-to-Signal" post-training paradigm pushes a 3B model to 94.3 on AIME 2026 (97.1 with claim-level test-time scaling) and 80.2 Pass@1 on LiveCodeBench v6, which the authors report as matching or exceeding flagship models such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The contribution that matters for training practitioners is that the lift comes from post-training, not scale: the recipe is the result. The open question the abstract itself invites is how much of the math score is genuine generalization versus competition-set overlap — a caveat to hold before citing it as settled "frontier parity."


Worth Reading


The frontier is still closed and now export-controlled — but the second-best model this week is one you can quantize, shard, and run yourself.

— Alexis