Deep Learning: the quadratic floor finally cracked, twice in one week May 13th 2026

Curated by Alexis

The week landed the same blow on quadratic attention from opposite directions. A Miami startup shipped a production model running 52x faster than dense attention at 1M tokens via a linear-time mechanism, and Tri Dao published a CuTeDSL rewrite that pushes Blackwell B200 BF16 attention to 71% of peak for the workloads still wedded to softmax. Add a Bayes-optimal scaling-law result and a diffusion-LM reasoning paper, and the quiet week at the model layer is masking a very loud week underneath it.

Watch & Listen First

Machine Learning Street Talk — The AI Progress Chart Everyone Is Misreading (Beth Barnes & David Rein) — METR's authors on why the 2x error bars on the time-horizon graph matter more than the headline doubling rate
Latent Space — Training Transformers to Solve the 95% Failure Rate of Cancer Trials (Ron Alfa & Daniel Bear, Noetik) — autoregressive transformer pretraining on tumor spatial transcriptomics; GSK's $50M licensing deal
Latent Space — METR's Joel Becker on Exponential Time-Horizon Evals — companion listen on how the eval is built

Key Takeaways

Treat the quadratic-attention floor as negotiable. SSA at 12M tokens and FA4 at 71% B200 utilization push the cost curve down at different points on the context axis. Pure dense softmax above 128k now has competitors on both flanks.
Audit FlexAttention deployments for the FA4 backend. Automatic CuTeDSL generation for score/mask mods gives 1.2–3.2x over Triton on compute-bound workloads — free if you switch backends.
Re-fit scaling-law extrapolations against Bayes-optimal. Adam-trained students at effective width hit theoretically optimal rates up to a small algorithmic gap; the curves you're fitting may be closer to a ceiling than assumed.
Stop pinning diffusion-LM block sizes. RL on dynamic-size reasoning blocks just beat fixed-size masked diffusion baselines — block schedule is now a learnable, not a hyperparameter.
Mechanistic interpretability claims need disclosed identification assumptions. Circuit and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. If a safety call rests on a circuit, write down what you assumed.

The Big Story

Subquadratic Launches SSA With a 12M-Token Context Window · May 5, 2026 · Subquadratic
→ The first venture-backed shop to put a fully subquadratic frontier-grade model into production. Subquadratic Selective Attention scales linearly in compute and memory with context, hits 92.1% on needle-in-a-haystack at 12M tokens, scores 83 on MRCR v2 (nine points above OpenAI's million-token baseline), and runs ~52x faster than dense attention at 1M tokens. Researchers are demanding independent replication of the 1000x efficiency claims, but a shipping coding agent and deep-research tool at this context profile is a credible non-incumbent benchmark for serving-cost math above 1M tokens.

Also This Week

FlashAttention-4 Publishes: 1605 TFLOPs/s on Blackwell BF16 · May 13, 2026 · Tri Dao blog
→ Pure CuTeDSL rewrite, asymmetric pipelining for matmul/softmax overlap, 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton — and on B200 the bottleneck is now SFU exp units and shared-memory traffic, not tensor cores. Open at Dao-AILab/flash-attention.

FlexAttention Adds an FA4 Backend · PyTorch blog
→ Custom attention variants — ALiBi, sliding window, document masking, soft-capping — JIT-compile to FA4 kernels on Hopper and Blackwell with 1.2–3.2x gains over Triton. The 1000+ FlexAttention repos cleared their ceiling without code changes.

Break the Block: RL on Dynamic-Size Reasoning Blocks for Diffusion LMs · May 4, 2026 · arXiv:2605.02263
→ RL with monotonic entropy descent lets a masked diffusion LM choose its own block size at each step — the first credible recipe for closing the autoregressive gap on reasoning, not just decoding speed.

Sharp Feature-Learning Transitions and Bayes-Optimal Scaling Laws · May 11, 2026 · arXiv:2605.10395
→ Two regimes — feature-learning n^(1/(2β)-1) and refinement n^(-1) — with empirical confirmation that Adam-trained students at effective width hit optimal rates. Calibrate your scaling fits against this ceiling before specifying the next run.

From the Lab

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling · May 8, 2026 · arXiv:2605.08083
→ AutoTTS reframes test-time-compute research from hand-designing TTS heuristics to designing environments in which agents discover them automatically. If you're still hand-tuning verifier+sampler combos, the search space just became something you instrument.

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims · May 8, 2026 · arXiv:2605.08012
→ Argues that circuit, mediator, and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. Internal interpretability reports grounding a safety decision now need an assumptions section, not just a methods section.

Worth Reading

FlashAttention-4 gives the NVIDIA Blackwell platform its most optimized attention kernel yet — Lambda's practitioner write-up, with the utilization numbers and pipelining diagrams Tri Dao's blog leaves implicit
VentureBeat: Miami startup Subquadratic claims 1,000x AI efficiency, researchers demand proof — Independent reporting with researcher pushback on the 1000x efficiency claim

Two attention rewrites in seven days — one architectural, one kernel-level. The slice of the field still treating dense softmax as the unmovable middle of the stack is shrinking.

Get more from AI Weekly

More signal, less noise — pick your channels.