The week landed the same blow on quadratic attention from opposite directions. A Miami startup shipped a production model running 52x faster than dense attention at 1M tokens via a linear-time mechanism, and Tri Dao published a CuTeDSL rewrite that pushes Blackwell B200 BF16 attention to 71% of peak for the workloads still wedded to softmax. Add a Bayes-optimal scaling-law result and a diffusion-LM reasoning paper, and the quiet week at the model layer is masking a very loud week underneath it.

Get more from AI Weekly

More signal, less noise — pick your channels.

You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.

  • → Explore 16 deep dives
    Weekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.
    Browse all 16 deep dives →
  • → Breaking AI alerts
    When something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.
    Get breaking alerts →
  • → AI News Today (live)
    Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.
    Open AI News Today →

Watch & Listen First


Key Takeaways

  • Treat the quadratic-attention floor as negotiable. SSA at 12M tokens and FA4 at 71% B200 utilization push the cost curve down at different points on the context axis. Pure dense softmax above 128k now has competitors on both flanks.
  • Audit FlexAttention deployments for the FA4 backend. Automatic CuTeDSL generation for score/mask mods gives 1.2–3.2x over Triton on compute-bound workloads — free if you switch backends.
  • Re-fit scaling-law extrapolations against Bayes-optimal. Adam-trained students at effective width hit theoretically optimal rates up to a small algorithmic gap; the curves you're fitting may be closer to a ceiling than assumed.
  • Stop pinning diffusion-LM block sizes. RL on dynamic-size reasoning blocks just beat fixed-size masked diffusion baselines — block schedule is now a learnable, not a hyperparameter.
  • Mechanistic interpretability claims need disclosed identification assumptions. Circuit and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. If a safety call rests on a circuit, write down what you assumed.

The Big Story

Subquadratic Launches SSA With a 12M-Token Context Window · May 5, 2026 · Subquadratic
The first venture-backed shop to put a fully subquadratic frontier-grade model into production. Subquadratic Selective Attention scales linearly in compute and memory with context, hits 92.1% on needle-in-a-haystack at 12M tokens, scores 83 on MRCR v2 (nine points above OpenAI's million-token baseline), and runs ~52x faster than dense attention at 1M tokens. Researchers are demanding independent replication of the 1000x efficiency claims, but a shipping coding agent and deep-research tool at this context profile is a credible non-incumbent benchmark for serving-cost math above 1M tokens.


Also This Week

FlashAttention-4 Publishes: 1605 TFLOPs/s on Blackwell BF16 · May 13, 2026 · Tri Dao blog
Pure CuTeDSL rewrite, asymmetric pipelining for matmul/softmax overlap, 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton — and on B200 the bottleneck is now SFU exp units and shared-memory traffic, not tensor cores. Open at Dao-AILab/flash-attention.

FlexAttention Adds an FA4 Backend · PyTorch blog
Custom attention variants — ALiBi, sliding window, document masking, soft-capping — JIT-compile to FA4 kernels on Hopper and Blackwell with 1.2–3.2x gains over Triton. The 1000+ FlexAttention repos cleared their ceiling without code changes.

Break the Block: RL on Dynamic-Size Reasoning Blocks for Diffusion LMs · May 4, 2026 · arXiv:2605.02263
RL with monotonic entropy descent lets a masked diffusion LM choose its own block size at each step — the first credible recipe for closing the autoregressive gap on reasoning, not just decoding speed.

Sharp Feature-Learning Transitions and Bayes-Optimal Scaling Laws · May 11, 2026 · arXiv:2605.10395
Two regimes — feature-learning n^(1/(2β)-1) and refinement n^(-1) — with empirical confirmation that Adam-trained students at effective width hit optimal rates. Calibrate your scaling fits against this ceiling before specifying the next run.


From the Lab

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling · May 8, 2026 · arXiv:2605.08083
AutoTTS reframes test-time-compute research from hand-designing TTS heuristics to designing environments in which agents discover them automatically. If you're still hand-tuning verifier+sampler combos, the search space just became something you instrument.

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims · May 8, 2026 · arXiv:2605.08012
Argues that circuit, mediator, and monosemanticity work uses causal vocabulary without the structural assumptions that make those claims identifiable. Internal interpretability reports grounding a safety decision now need an assumptions section, not just a methods section.


Worth Reading


Two attention rewrites in seven days — one architectural, one kernel-level. The slice of the field still treating dense softmax as the unmovable middle of the stack is shrinking.