Three results this week argue the frontier is moving up a level: agentic systems doing original math, alignment delivered through documents not demonstrations, and inference-time strategies discovered automatically. None are scale stories — all three are about what you wrap around the model.
Get more from AI Weekly
More signal, less noise — pick your channels.
You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.
-
→ Explore 16 deep divesWeekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.Browse all 16 deep dives →
-
→ Breaking AI alertsWhen something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.Get breaking alerts →
-
→ AI News Today (live)Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.Open AI News Today →
Watch & Listen First
- Machine Learning Street Talk — Beth Barnes and David Rein on the graph that ate the AI timelines discourse (YouTube, May 4)
- Latent Space — A Physicist on how AI is changing the way they do physics (Substack, May 5)
- Dwarkesh Podcast — David Reich on the Bronze Age inflection point (May 8)
Key Takeaways
- Math SOTA decoupled from the base model. Co-Mathematician's 48% on FrontierMath Tier 4 is a workflow result on Gemini 3 Deep Think — treat hard-benchmark SOTA as a system-level claim from here on.
- Document-based alignment is credible. Anthropic's SDF-on-constitution (96% → 0% blackmail, Opus 4 to Haiku 4.5+) shifts alignment R&D from demonstration-heavy SFT toward principles + character. Re-cost your safety pipelines.
- Test-time scaling strategies are search problems. AutoTTS discovered TTS controllers that beat hand-designed baselines for $39.9. The hand-tuned width/depth-schedule era is over.
- Reasoning failures are incoherent, not adversarial. Hot Mess keeps holding: as trajectories lengthen, error shifts systematic → random. Plan evals around incoherence, not goal-misalignment red-teaming.
- Latent diffusion LMs cleared the joint-training bar. Two papers in seven days beat discrete diffusion at 2–13× decode speed. The architecture race below the apex is no longer hypothetical.
The Big Story
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI · May 7, 2026 · arXiv:2605.06651 · DeepMind blog
→ 48% on FrontierMath Tier 4 — vs 22.9% for Claude Opus 4.7 — from a workbench wrapping Gemini 3 Deep Think, not a new base model. A project-coordinator agent dispatches specialists across parallel workstreams, keeps a failure-hypothesis log, emits LaTeX with provenance. Interesting pathology: "reviewer-pleasing bias" — critic and prover share a prior, so self-review converges on subtly wrong arguments. A named failure mode adjacent to sycophancy, worth its own evals. Oxford's Marc Lackenby closed an open Kourovka Notebook problem with it after the reviewer flagged a gap.
Also This Week
Teaching Claude Why — SDF on constitution + fiction collapses agentic blackmail · May 11 · Anthropic
→ SDF on a constitutional corpus plus fictional aligned-AI stories cut held-out agentic-blackmail rates 3×; cumulative 96% (Opus 4) to 0% (Haiku 4.5+), with a 3M-token "difficult advice" set giving ~28× efficiency over demonstrations.
The Hot Mess of AI: misalignment decomposed into bias and variance · May 2026 · alignment.anthropic.com · arXiv:2601.23045
→ Longer reasoning chains push errors from systematic toward stochastic; scale improves accuracy but not incoherence — safety cases built around coherent goal-pursuit are calibrating against the wrong distribution.
AI for Math Initiative — DeepMind + five institutions · May 12 · Google
→ Co-launched with Imperial, IAS, IHÉS, Simons, and Tata — the institutional scaffolding under Co-Mathematician; math-AI is getting its own venue stack.
World Models named to MIT Tech Review's 10 Things That Matter in AI · May 12 · MIT Tech Review
→ Analyst-consensus moment for the post-LLM track — V-JEPA, Genie, Cosmos, and LeCun's AMI Labs as a research program with its own data, evals, and scaling curve.
From the Lab
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling (AutoTTS) · Zheng et al., May 8 · arXiv:2605.08083
→ Formulates width-depth TTS as controller synthesis over pre-collected reasoning trajectories — an agent picks branch/continue/probe/prune/stop without repeated LLM calls, with beta parameterization making search tractable. Discovered controllers beat hand-designed baselines on math benchmarks and generalize to held-out tasks and model scales. Total discovery cost: $39.9, 160 minutes. The hand-tuned TTS heuristic just became a regression line.
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space (LDLM) · May 8 · arXiv:2605.07933
→ Trains the latent encoder, diffusion model, and decoder jointly — the image-diffusion recipe text had not cleanly pulled off. Beats discrete and continuous diffusion baselines at 2–13× decode speed. With contemporaneous Cola DLM (arXiv:2605.06548), latent diffusion is now the most credible non-autoregressive track.
Superposition Is Not Necessary: Mechanistic Interpretability of Transformer Representations for Time Series Forecasting · May 6 · arXiv:2605.05151
→ SAEs on PatchTST find the forecasting transformer doesn't rely on superposition — a counterexample to the assumption that SAE decomposition only makes sense for densely polysemantic LLM features. The "superposition is fundamental" framing needs caveats outside language.
Worth Reading
- Navigating by Old Maps: Pitfalls of Static Mechanistic Localization in LLM Post-Training — locate-then-update fails when post-training shifts parameters under the located circuits; read before shipping interpretability-guided edits
- Anthropic's "Teaching Claude Why" research note — public-facing companion to the alignment blog; cleaner on the SDF recipe
- AutoTTS project page + code — cheapest way to see if your benchmark gives up new TTS controllers under the same search
Frontier results aren't arriving as model-card numbers anymore — they arrive as workflows, supervision recipes, and discovered controllers wrapped around a base that hasn't moved in three months. Whoever masters the wrapper sets the next benchmark.