Machine Learning: a 744B model ran across six states on raw WAN

June 21st 2026 · By Alexis Dufresne

June 21, 2026

The week ML stopped needing a datacenter. Shard served GLM-5.2 (744B) at ~30 tok/s over the open internet across RTX PRO 6000 GPUs in six states. A 3B model from Weibo posted 94.3 on AIME 2026. MLPerf 6.0 finally benchmarked MoE pre-training at scale.

Watch & Listen First

Why AI Agents Break the GenAI Security Model — TWIML #770 with Devvret Rishi · TWIML AI Podcast · 2026-06-16
→ Rubrik's GM of AI on why tool-calling agents that write code and update systems at machine speed break governance models built on pre-written rules and one-at-a-time approval prompts.

Key Takeaways

Inference is going WAN-distributed. Pipeline-parallel layer-splitting means no single node holds the whole model — frontier-scale serving without co-location is now a working demo, not a thesis.
Small reasoning models are the real efficiency story. VibeThinker-3B's 94.3 AIME26 from a 3B dense base argues verifiable reasoning compresses into a tiny "reasoning core" — but read the training recipe, not just the leaderboard.
MLPerf 6.0 made MoE pre-training a first-class benchmark. DeepSeek-V3 671B and GPT-OSS-20B joined the suite; the benchmark caught up to where production training actually is.
The open-weights frontier is Chinese and cheap. GLM-5.2 is the new top open model — ~$1.40/M input — and it's the model the distributed-inference crowd is racing to self-host.

The Big Story

A 744B model served at ~30 tok/s across RTX PRO 6000 GPUs in six US states, over the open internet · GitHub · 2026-06-18
→ Shard splits GLM-5.2 (744B, NVFP4, 78 layers) into 13-layer contiguous blocks, one shard per GPU, and streams activations through Nevada, Texas, Minnesota, Missouri, and Utah with a Washington coordinator — 22–75ms WAN round-trips and all. The ~30 tok/s comes from pipelined speculative decoding: a CUDA-graphed GLM-4-9B draft proposes tokens, the distributed 744B verifies them, and async pipelining keeps multiple verify chunks in flight so the loop runs at the pipeline's throughput, not its latency. The takeaway for practitioners is architectural: if you can tolerate pipeline-parallel layer-streaming and lean on a small draft model, frontier-scale inference no longer requires a co-located cluster or a hyperscaler.

Also This Week

NVIDIA Blackwell swept all seven MLPerf Training 6.0 benchmarks; CoreWeave trained DeepSeek-V3 671B in 2.02 minutes at 8,192 GPUs · NVIDIA · 2026-06-16
→ The round added DeepSeek-V3 671B and GPT-OSS-20B as new mixture-of-experts pre-training workloads, and CoreWeave's record run reached the quality target in 2.02 minutes at 8,192-GPU scale — so MoE training-at-scale is now a measured, reproducible benchmark instead of a vendor claim.

Apple's Core AI succeeds Core ML as an on-device LLM framework supporting 3B-to-70B reasoning models across iPhone, iPad, Mac, and Vision Pro · InfoQ · 2026-06-20
→ A memory-safe Swift API with zero-copy data paths runs workloads across CPU, GPU, and Neural Engine, and a PyTorch converter exports straight to the format — so the on-device deployment target for quantized open models just got a first-party toolchain with no per-token cloud cost.

From the Lab

VibeThinker-3B: a 3B dense model scoring 94.3 on AIME26 · arXiv · 2026-06-15
→ Weibo's team post-trained a 3B dense model that scores 94.3 on AIME26 (97.1 with claim-level test-time scaling) and 80.2 Pass@1 on LiveCodeBench v6, with a 96.1% acceptance rate on unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro at orders of magnitude fewer parameters. The method is the contribution: a "Spectrum-to-Signal" pipeline of curriculum SFT, multi-domain RL, and offline self-distillation. The paper's framing — its "Parametric Compression-Coverage Hypothesis," that verifiable reasoning compresses into a compact "reasoning core" while broad knowledge needs parameter coverage — is the part worth arguing with, since AIME contamination and test-time-scaling caveats apply. Read it if you do RL post-training or care about where the small-model floor actually sits.

Worth Reading

GLM-5.2 is probably the most powerful text-only open weights LLM · Simon Willison · 2026-06-17
→ Hands-on: 753B total / 40B active MoE, a 1M-token context window (up from 200K), MIT-licensed June 16, ranking 2nd on Code Arena WebDev behind only Claude Fable 5 — the model behind this week's distributed-inference race, with the pricing and provider detail you need before self-hosting.
NVIDIA Blackwell Tops MLPerf Training 6.0 · NVIDIA · 2026-06-16
→ The engineering write-up behind the headline: per-accelerator normalization, Spectrum-X Ethernet scaling, and how the GB300 NVL72 hit 1.6× over GB200 — useful if you size training clusters.

The frontier is still expensive to build — but this week proved it's getting cheap to borrow, one layer-shard and one 3B model at a time.

— Alexis

Stay ahead in AI

Join 50,000+ professionals getting the AI briefing that matters. 3x/week, free, no spam.