Deep Learning: the 4-bit training wall just cracked May 21st 2026

Curated by Alexis

Frontier labs went quiet on parameter count this week and loud on numerics. Microscaling 4-bit formats crossed from an inference trick into a validated pretraining recipe, diffusion language models learned to walk back their own decoding mistakes mid-generation, and a mean-field theory paper handed transformer training its first proper convergence proof. If you train models, the precision floor under your optimizer just dropped a level.

Watch & Listen First

NVIDIA AI Podcast — "Snap's Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark" · May 13 · Listen
→ The unglamorous half of deep learning: Snap's GPU-accelerated Spark pipeline cut data-prep cost 76% and eliminated 120 TB of disk spill — a reminder that training throughput dies in the dataloader before it dies in the kernel.

Latent Space — Ambient Clinical Intelligence with Abridge · May 14 · Listen
→ A working deployment of always-on transformer inference under hard latency budgets — worth it for how production constraints reshape model size and decoding strategy.

Key Takeaways

4-bit pretraining is real now. NVIDIA's 12B Mamba-Transformer hit FP8-parity MMLU-Pro across 10 trillion tokens in NVFP4 — the longest documented sub-8-bit run.
Weight gradients are the FP4 fault line. A separate MXFP4 study isolates Wgrad quantization as the divergence trigger; deterministic (not random) Hadamard rotations fix it.
Diffusion LLMs can self-correct. Making parallel decoding revokable buys a 6×+ step reduction without the usual quality collapse.
Transformer training has a convergence proof. In the mean-field depth-and-width limit, training is a neural PDE — global minima are reachable under an NTK-injectivity condition.
MoE routing still quietly breaks. Deep-layer routing collapse appears on under-represented data; continual pretraining, not new auxiliary losses, repairs it.

The Big Story

NVIDIA validates 4-bit pretraining: a 12B Mamba-Transformer matches FP8 across 10 trillion tokens · May 18, 2026 · MarkTechPost
→ NVFP4 packs values as E2M1 in 16-element micro-blocks, each carrying an E4M3 block scale plus an FP32 per-tensor scale; the run held validation loss within ~1% of an FP8 baseline (62.58% vs 62.62% on MMLU-Pro 5-shot). The recipe is four stabilizers working together — ~16% of linear layers kept in BF16 (the first two and final eight of 62 blocks), Random Hadamard transforms on weight-gradient inputs, 2D block scaling so forward and backward passes quantize consistently, and stochastic rounding restricted to gradients. For practitioners the math is blunt: FP4 GEMMs run 2–3× faster than FP8 on GB200/GB300 and roughly halve operand memory, and NVFP4 reached target loss with 36% fewer tokens than MXFP4 — and this wasn't the only FP4 paper of the week.

Also This Week

Zyphra converts an autoregressive LLM into an MoE diffusion model with a 7.7× decode speedup · May 15 · MarkTechPost
→ ZAYA1-8B-Diffusion-Preview is the first AR-to-diffusion conversion done at MoE scale, meaning teams can retrofit parallel diffusion sampling onto already-trained autoregressive weights instead of pretraining a diffusion model from scratch.

Mixture-of-experts routers quietly collapse in deep layers on under-represented data · ~May 17 · arXiv
→ A routing study finds pretrained MoE models stop discriminating between experts in their deepest layers on low-resource inputs, and continual pretraining on balanced data — not new auxiliary losses — is what restores specialization.

From the Lab

Pretraining Large Language Models with MXFP4 on Native FP4 Hardware · arXiv
→ The diagnostic counterpart to NVIDIA's recipe: in full MXFP4 pretraining of Llama 3.1-8B on C4, quantizing weight gradients (Wgrad) is isolated as the primary driver of convergence degradation, while FP4 in the forward pass and activation gradients adds only modest token overhead. Stochastic rounding and randomized Hadamard rotations fail to stabilize a quantized Wgrad — only deterministic Hadamard rotations consistently restore stable optimization.

Roll Out and Roll Back: Diffusion LLMs Are Their Own Efficiency Teachers · arXiv
→ Diffusion LLMs lose quality when you reveal multiple tokens per step because irreversible decoding amplifies a train–inference mismatch. WINO makes parallel decoding revokable — roll tokens out, roll the unreliable ones back — and WINO+ distills that verified denoising order into the weights, lifting GSM8K from 73.24% to 76.58% with a 6.83× step reduction and reaching 16.22× on Flickr30K.

Training Infinitely Deep and Wide Transformers · arXiv
→ Take depth and width to infinity together and transformer training stops looking like a ResNet's neural ODE and becomes control of a neural PDE, because attention couples the token distributions. The authors derive an explicit Wasserstein gradient and prove gradient flow reaches global minima under an NTK-injectivity condition — shown equivalent to linear independence of log-sum-exp functions modulo affine terms.

Worth Reading

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy — NVIDIA's own engineering write-up: the deep version of the Big Story, with the Transformer Engine recipe spelled out.
The Newest Google and Nvidia Chips Speed AI Training — IEEE Spectrum on the MLPerf training picture that 4-bit numerics are now actively reshaping.

The compute curve is bending through the mantissa now, not the parameter count — for once, watch the bits, not the billions.

Get more from AI Weekly

More signal, less noise — pick your channels.