The week landed the same blow on quadratic attention from opposite directions. A Miami startup shipped a production model running 52x faster than dense attention at 1M tokens via a linear-time mechanism, and Tri Dao published a CuTeDSL rewrite that pushes Blackwell B200 BF16 attention to 71% of peak for the workloads still wedded to softmax. Add a Bayes-optimal scaling-law result and a diffusion-LM reasoning paper, and the quiet week at the model layer is masking a very loud week underneath it.
Get more from AI Weekly
More signal, less noise — pick your channels.
You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.
-
→ Explore 16 deep divesWeekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.Browse all 16 deep dives →
-
→ Breaking AI alertsWhen something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.Get breaking alerts →
-
→ AI News Today (live)Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.Open AI News Today →
Watch & Listen First
- Machine Learning Street Talk — The AI Progress Chart Everyone Is Misreading (Beth Barnes & David Rein) — METR's authors on why the 2x error bars on the time-horizon graph matter more than the headline doubling rate
- Latent Space — Training Transformers to Solve the 95% Failure Rate of Cancer Trials (Ron Alfa & Daniel Bear, Noetik) — autoregressive transformer pretraining on tumor spatial transcriptomics; GSK's $50M licensing deal
- Latent Space — METR's Joel Becker on Exponential Time-Horizon Evals — companion listen on how the eval is built
Key Takeaways
- Treat the quadratic-attention floor as negotiable. SSA at 12M tokens and FA4 at 71% B200 utilization push the cost curve down at different points on the context axis. Pure dense softmax above 128k now has competitors on both flanks.
- Audit FlexAttention deployments for the FA4 backend. Automatic CuTeDSL generation for score/mask mods gives 1.2–3.2x over Triton on compute-bound workloads — free if you switch backends.
- Re-fit scaling-law extrapolations against Bayes-optimal. Adam-trained students at effective width hit theoretically optimal rates up to a small algorithmic gap; the curves you're fitting may be closer to a ceiling than assumed.
- Stop pinning diffusion-LM block sizes. RL on dynamic-size reasoning blocks just beat fixed-size masked diffusion baselines — block schedule is now a learnable, not a hyperparameter.
- Mechanistic interpretability claims need disclosed identification assumptions. Circuit and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. If a safety call rests on a circuit, write down what you assumed.
The Big Story
Subquadratic Launches SSA With a 12M-Token Context Window · May 5, 2026 · Subquadratic
→ The first venture-backed shop to put a fully subquadratic frontier-grade model into production. Subquadratic Selective Attention scales linearly in compute and memory with context, hits 92.1% on needle-in-a-haystack at 12M tokens, scores 83 on MRCR v2 (nine points above OpenAI's million-token baseline), and runs ~52x faster than dense attention at 1M tokens. Researchers are demanding independent replication of the 1000x efficiency claims, but a shipping coding agent and deep-research tool at this context profile is a credible non-incumbent benchmark for serving-cost math above 1M tokens.
Also This Week
FlashAttention-4 Publishes: 1605 TFLOPs/s on Blackwell BF16 · May 13, 2026 · Tri Dao blog
→ Pure CuTeDSL rewrite, asymmetric pipelining for matmul/softmax overlap, 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton — and on B200 the bottleneck is now SFU exp units and shared-memory traffic, not tensor cores. Open at Dao-AILab/flash-attention.
FlexAttention Adds an FA4 Backend · PyTorch blog
→ Custom attention variants — ALiBi, sliding window, document masking, soft-capping — JIT-compile to FA4 kernels on Hopper and Blackwell with 1.2–3.2x gains over Triton. The 1000+ FlexAttention repos cleared their ceiling without code changes.
Break the Block: RL on Dynamic-Size Reasoning Blocks for Diffusion LMs · May 4, 2026 · arXiv:2605.02263
→ RL with monotonic entropy descent lets a masked diffusion LM choose its own block size at each step — the first credible recipe for closing the autoregressive gap on reasoning, not just decoding speed.
Sharp Feature-Learning Transitions and Bayes-Optimal Scaling Laws · May 11, 2026 · arXiv:2605.10395
→ Two regimes — feature-learning n^(1/(2β)-1) and refinement n^(-1) — with empirical confirmation that Adam-trained students at effective width hit optimal rates. Calibrate your scaling fits against this ceiling before specifying the next run.
From the Lab
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling · May 8, 2026 · arXiv:2605.08083
→ AutoTTS reframes test-time-compute research from hand-designing TTS heuristics to designing environments in which agents discover them automatically. If you're still hand-tuning verifier+sampler combos, the search space just became something you instrument.
Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims · May 8, 2026 · arXiv:2605.08012
→ Argues that circuit, mediator, and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. Internal interpretability reports grounding a safety decision now need an assumptions section, not just a methods section.
Worth Reading
- FlashAttention-4 gives the NVIDIA Blackwell platform its most optimized attention kernel yet — Lambda's practitioner write-up, with the utilization numbers and pipelining diagrams Tri Dao's blog leaves implicit
- VentureBeat: Miami startup Subquadratic claims 1,000x AI efficiency, researchers demand proof — Independent reporting with researcher pushback on the 1000x efficiency claim
Two attention rewrites in seven days — one architectural, one kernel-level. The slice of the field still treating dense softmax as the unmovable middle of the stack is shrinking.