deep-learning News: NVIDIA validates NVFP4 pretraining on a 12B Mamba-Transformer at 10T t — May 26, 2026

NVFP4 makes the leap from theory to 10T tokens, and Mamba-3 keeps eating into the attention budget.


The week's center of gravity sits in numerical precision: NVIDIA published a 4-bit pretraining recipe that holds at the multi-trillion-token horizon, which is the regime where every prior low-precision claim has cracked. Around it, the inference-first wave from Mamba-3 keeps spreading — and a fresh streaming-inference paper attacks prefill latency from another angle. Meanwhile the ViT community quietly admitted that "registers" are not enough.


Watch & Listen First


Key Takeaways

  • NVFP4 closed the FP8 gap at 10T tokens. Stochastic rounding on grads, 2D 16×16 weight scaling, Hadamard rotations on Wgrad, and BF16 retention on ~16% of linears were all required — none optional.
  • Mamba-3 is an inference-first design. Half the state of Mamba-2 for matched perplexity; the short causal conv is gone, replaced by biases on B/C plus a new discretization-based recurrence.
  • Streaming inference is the new prefill story. Stateful KV-cache advancement makes query latency O(|q|), independent of accumulated context.
  • ViT representations are still leaking attention sinks. Registers helped — but didn't solve it.
  • The optimizer race isn't over. Muon variants (AdaMuon, NAMO) keep posting wall-clock wins over AdamW at the 100M–1B scale where most fine-tuning lives.

The Big Story

NVIDIA validates NVFP4 pretraining on a 12B Mamba-Transformer at 10T tokens · May 18 · MarkTechPost · NVIDIA blog
The headline number is MMLU-Pro 62.58% in NVFP4 vs 62.62% in FP8 — statistical noise — on a Nemotron-Nano-12B-v2-Base architecture (6 self-attention / 28 FFN / 28 Mamba-2 blocks). What matters isn't the score but the recipe: NVFP4 GEMMs hit 2× FP8 throughput on GB200 and 3× on GB300, and the convergence cocktail (16×16 RHTs on Wgrad inputs, 2D weight scaling, stochastic gradient rounding, BF16 islands) finally makes a "no-asterisk" 4-bit pretraining run reproducible. If you've been pricing your next training cluster against FP8 throughput, that math just changed.


Also This Week

Mamba-3: methodological deep dive lands on Tri Dao and Goomba Lab · ongoing · Tri Dao · Goomba Lab
Exponential-trapezoidal discretization plus complex-valued SSM finally gives state-space models parity-tracking — closing the last expressivity gap practitioners cared about versus attention.

Vision Transformers Need More Than Registers (CVPR 2026) · GitHub
Register tokens dampen attention sinks but don't kill them; LAST-ViT adds learned local-aggregation tokens that clean up the residual artifacts in dense prediction.

Open-source Mamba-3 reportedly +4% on language modeling vs Transformer baselines · VentureBeat
The "cold GPU at decode" framing is becoming the dominant pitch — inference latency and cache footprint are now the benchmark, not just perplexity.

PyTorch/XLA 2.7 ships a JAX bridge plus Pallas paged-attention kernel · PyTorch blog
Calling JAX ops inside an XLA graph from PyTorch is the clearest sign yet that the two frameworks are merging at the compiler layer, not the API.


From the Lab

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers · arXiv 2605.13784 · arXiv · May 13
Persistent KV cache advanced incrementally per token, moving prefill off the critical path. Query latency becomes O(|q|) — independent of accumulated context — which is the right primitive for long-running agents that today re-prefill on every turn.

Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks · arXiv 2605.25452 · arXiv · May 25
Reconciles uniform-convergence bounds with WL-style expressivity arguments under a single statistical frame. Practically, it explains why GNNs that should generalize from a learning-theory view stall on real graphs — the bottleneck is graph-isomorphism expressivity, not sample complexity.

A Generative Pretrained Transformer with Kerr-Soliton Attention · arXiv 2605.24124 · arXiv · May 22
Realizes the attention operation in driven-dissipative nonlinear photonic hardware. Not production-ready, but it's the first end-to-end physical realization of attention that validates against a software baseline — interesting if you've been tracking optical compute as the post-Blackwell exit.


Worth Reading


The week's quiet through-line: the frontier is no longer pre-training perplexity — it's how cheaply you can serve a token at decode. NVFP4, Mamba-3, and stateful streaming inference are all answering the same question.