The architectural monoculture cracked this week. For three years the frontier was a single recipe — dense attention, real-valued states, autoregressive decoding, FP8 arithmetic — and every component of that recipe just absorbed a credible challenger on the same page of results. The interesting question is no longer which axis will flex first, but how fast a shop can hold four of them at once without stalling its pipeline. Treat the default stack as a negotiable set of priors, not a platform.

Watch & Listen First


Key Takeaways

  • Revisit long-context serving economics this quarter. Linear-time sequence models just reached transformer-equivalent quality at half the hidden state, which rewrites the cost curve for 100k-token workloads before any procurement cycle closes.
  • Audit any attention-based forecaster against a squared-loss baseline. Under low-SNR, MSE-trained regimes, there is now a formal proof that more expressivity makes predictions worse — if you run anomaly detection or financial time-series on a transformer, assume the architecture is the bug.
  • Stop treating non-autoregressive decoding as a throughput trick. A diffusion LM just beat a 2x-larger autoregressive baseline on reasoning benchmarks, so the case for parallel decoding on agentic workloads is now quality-positive, not a tradeoff.
  • Move FP8 from "default" to "legacy" in your precision policy. 4-bit block-scaled formats are landing within 1% of FP8 on frontier evals with 2-3x throughput, and the teams still pinning FP8 for training are paying the tax twice — in arithmetic and in memory bandwidth.
  • Put a neuro-symbolic baseline on the eval sheet for any embodied project. A 100x energy cut and 95% task success against a 34% neural baseline means the "scale the network" default is no longer automatic for control or VLA stacks.

The Big Story

Mamba-3 Lands a Half-State SSM That Matches Transformer Perplexity · April 15, 2026 · OpenReview
The most important non-transformer result of the quarter. Lahoti, Li, Chen, Wang, Bick, Kolter, Dao, and Gu replace Mamba-2's real-valued diagonal state recurrence with a complex-valued update derived from a tighter discretization of the underlying continuous-time SSM. Complex states encode phase alongside magnitude, which lets a given hidden dimension carry roughly twice the information — Mamba-2-equivalent perplexity at half the state size, verified from 130M to 1.5B. The MIMO variant adds grouped multi-input/multi-output projections and gains +1.2 accuracy at 1.5B without added FLOPs. Linear-time inference at transformer-equivalent quality is now a deployable profile; serving-cost math at long context starts to look genuinely different from attention's quadratic floor.


Also This Week

Introspective Diffusion LMs Close the Autoregressive Gap · April 14, 2026 · project page
I-DLM-8B beats LLaDA-2.1-mini (16B) on AIME-24 and LiveCodeBench. Unlike prior diffusion LMs with a fixed masked-token schedule, I-DLM adds a learned introspective scheduler — a critic that predicts per-token uncertainty and adaptively reorders the denoising trajectory — trained with masked-diffusion loss plus a self-consistency objective over intermediate states. First credible non-autoregressive win on reasoning, not just throughput.

Routing Mamba: Scaling SSMs with Mixture-of-Experts Projection · April 2026 · Microsoft Research
Sparse projection MoE grafted onto the SSM input/output projections — early evidence the MoE-vs-SSM choice is not binary, and that routing can be the scaling axis for both.

NVFP4 Post-Training Quantization Hits Near-FP8 Accuracy · April 2026 · NVIDIA Research
With a single quantization-aware distillation pass, NVFP4 lands within 1% of FP8 on frontier eval suites. 16-element block size (vs. 32 for MXFP4), FP8 block scales, and a second-level FP32 tensor scale — the precision floor just dropped again, and inference costs with it.


From the Lab

Mamba-3: Improved Sequence Modeling Using State Space Principles · Lahoti, Li, Chen, Wang, Bick, Kolter, Dao, Gu, 2026 · OpenReview
Complex state updates, tighter discretization, and a MIMO variant combine for Mamba-2 perplexity at half the state, plus +1.2 accuracy at 1.5B. Read alongside Routing Mamba — the SSM stack is now compositional.

Forecast Collapse of Transformer-Based Models Under Squared Loss in Financial Time Series · Andreoletti, 2026 · arXiv:2604.00064
Formal inverse-scaling proof: in low-SNR regimes under MSE, where the conditional expectation of future trajectories is effectively degenerate, increasing transformer expressivity strictly increases forecast error — extra capacity spends itself on spurious fluctuations around the Bayes-optimal predictor, raising variance without reducing bias. Teams using attention for financial time-series or heavy-tailed anomaly detection need to re-examine both loss and architecture.

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery · NVIDIA Research, 2026 · NVIDIA
Reference for large-batch inference on Blackwell-class hardware: sub-1% gap to FP8 after one distillation pass, with a two-level scaling scheme that keeps block-wise error bounded without FP32 accumulation overhead.


Worth Reading


The architecture debate has three live fronts — attention versus SSMs, autoregressive versus diffusion, neural versus hybrid — and each landed a real result in the same seven days. Anyone still training the January model is compounding a quiet efficiency deficit.