NVFP4 makes the leap from theory to 10T tokens, and Mamba-3 keeps eating into the attention budget.
The week's center of gravity sits in numerical precision: NVIDIA published a 4-bit pretraining recipe that holds at the multi-trillion-token horizon, which is the regime where every prior low-precision claim has cracked. Around it, the inference-first wave from Mamba-3 keeps spreading — and a fresh streaming-inference paper attacks prefill latency from another angle. Meanwhile the ViT community quietly admitted that "registers" are not enough.
Watch & Listen First
- Chip Design from the Bottom Up — Reiner Pope (Dwarkesh, ~May 22) — a blackboard walkthrough from logic gates up to why GPUs, TPUs, FPGAs and brains look the way they do, by MatX's CEO and ex-Google TPU compiler lead.
- Latent Space — Jake Cooper on agent-native bare-metal infra (May 20) — Railway's CEO on stateful sandboxes, scheduler-aware KV serving, and why hyperscalers keep losing the inference-per-watt fight.
- Latent Space — Ivan Burazin on Daytona AI sandboxes (May 21) — composable computers for agents and the trade-offs of running stateful inference on bare metal.
Key Takeaways
- NVFP4 closed the FP8 gap at 10T tokens. Stochastic rounding on grads, 2D 16×16 weight scaling, Hadamard rotations on Wgrad, and BF16 retention on ~16% of linears were all required — none optional.
- Mamba-3 is an inference-first design. Half the state of Mamba-2 for matched perplexity; the short causal conv is gone, replaced by biases on B/C plus a new discretization-based recurrence.
- Streaming inference is the new prefill story. Stateful KV-cache advancement makes query latency O(|q|), independent of accumulated context.
- ViT representations are still leaking attention sinks. Registers helped — but didn't solve it.
- The optimizer race isn't over. Muon variants (AdaMuon, NAMO) keep posting wall-clock wins over AdamW at the 100M–1B scale where most fine-tuning lives.
The Big Story
NVIDIA validates NVFP4 pretraining on a 12B Mamba-Transformer at 10T tokens · May 18 · MarkTechPost · NVIDIA blog
→ The headline number is MMLU-Pro 62.58% in NVFP4 vs 62.62% in FP8 — statistical noise — on a Nemotron-Nano-12B-v2-Base architecture (6 self-attention / 28 FFN / 28 Mamba-2 blocks). What matters isn't the score but the recipe: NVFP4 GEMMs hit 2× FP8 throughput on GB200 and 3× on GB300, and the convergence cocktail (16×16 RHTs on Wgrad inputs, 2D weight scaling, stochastic gradient rounding, BF16 islands) finally makes a "no-asterisk" 4-bit pretraining run reproducible. If you've been pricing your next training cluster against FP8 throughput, that math just changed.
Also This Week
Mamba-3: methodological deep dive lands on Tri Dao and Goomba Lab · ongoing · Tri Dao · Goomba Lab
→ Exponential-trapezoidal discretization plus complex-valued SSM finally gives state-space models parity-tracking — closing the last expressivity gap practitioners cared about versus attention.
Vision Transformers Need More Than Registers (CVPR 2026) · GitHub
→ Register tokens dampen attention sinks but don't kill them; LAST-ViT adds learned local-aggregation tokens that clean up the residual artifacts in dense prediction.
Open-source Mamba-3 reportedly +4% on language modeling vs Transformer baselines · VentureBeat
→ The "cold GPU at decode" framing is becoming the dominant pitch — inference latency and cache footprint are now the benchmark, not just perplexity.
PyTorch/XLA 2.7 ships a JAX bridge plus Pallas paged-attention kernel · PyTorch blog
→ Calling JAX ops inside an XLA graph from PyTorch is the clearest sign yet that the two frameworks are merging at the compiler layer, not the API.
From the Lab
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers · arXiv 2605.13784 · arXiv · May 13
→ Persistent KV cache advanced incrementally per token, moving prefill off the critical path. Query latency becomes O(|q|) — independent of accumulated context — which is the right primitive for long-running agents that today re-prefill on every turn.
Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks · arXiv 2605.25452 · arXiv · May 25
→ Reconciles uniform-convergence bounds with WL-style expressivity arguments under a single statistical frame. Practically, it explains why GNNs that should generalize from a learning-theory view stall on real graphs — the bottleneck is graph-isomorphism expressivity, not sample complexity.
A Generative Pretrained Transformer with Kerr-Soliton Attention · arXiv 2605.24124 · arXiv · May 22
→ Realizes the attention operation in driven-dissipative nonlinear photonic hardware. Not production-ready, but it's the first end-to-end physical realization of attention that validates against a software baseline — interesting if you've been tracking optical compute as the post-Blackwell exit.
Worth Reading
- NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit — the engineering writeup behind the 10T-token claim, with the actual scaling-block layout and gradient-quantization choices.
- Mamba-3 Part 1 — design philosophy — Tri Dao on why "inference-first" is the right north star and how the recurrence was re-derived from the underlying ODE.
- Princeton Language and Intelligence: Mamba-3 launch notes — the MIMO formulation explained clearly enough to implement against.
The week's quiet through-line: the frontier is no longer pre-training perplexity — it's how cheaply you can serve a token at decode. NVFP4, Mamba-3, and stateful streaming inference are all answering the same question.