Frontier Research: agents start doing the math, not just the coding May 13th 2026

Curated by Alexis

Three results this week argue the frontier is moving up a level: agentic systems doing original math, alignment delivered through documents not demonstrations, and inference-time strategies discovered automatically. None are scale stories — all three are about what you wrap around the model.

Watch & Listen First

Machine Learning Street Talk — Beth Barnes and David Rein on the graph that ate the AI timelines discourse (YouTube, May 4)
Latent Space — A Physicist on how AI is changing the way they do physics (Substack, May 5)
Dwarkesh Podcast — David Reich on the Bronze Age inflection point (May 8)

Key Takeaways

Math SOTA decoupled from the base model. Co-Mathematician's 48% on FrontierMath Tier 4 is a workflow result on Gemini 3 Deep Think — treat hard-benchmark SOTA as a system-level claim from here on.
Document-based alignment is credible. Anthropic's SDF-on-constitution (96% → 0% blackmail, Opus 4 to Haiku 4.5+) shifts alignment R&D from demonstration-heavy SFT toward principles + character. Re-cost your safety pipelines.
Test-time scaling strategies are search problems. AutoTTS discovered TTS controllers that beat hand-designed baselines for $39.9. The hand-tuned width/depth-schedule era is over.
Reasoning failures are incoherent, not adversarial. Hot Mess keeps holding: as trajectories lengthen, error shifts systematic → random. Plan evals around incoherence, not goal-misalignment red-teaming.
Latent diffusion LMs cleared the joint-training bar. Two papers in seven days beat discrete diffusion at 2–13× decode speed. The architecture race below the apex is no longer hypothetical.

The Big Story

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI · May 7, 2026 · arXiv:2605.06651 · DeepMind blog
→ 48% on FrontierMath Tier 4 — vs 22.9% for Claude Opus 4.7 — from a workbench wrapping Gemini 3 Deep Think, not a new base model. A project-coordinator agent dispatches specialists across parallel workstreams, keeps a failure-hypothesis log, emits LaTeX with provenance. Interesting pathology: "reviewer-pleasing bias" — critic and prover share a prior, so self-review converges on subtly wrong arguments. A named failure mode adjacent to sycophancy, worth its own evals. Oxford's Marc Lackenby closed an open Kourovka Notebook problem with it after the reviewer flagged a gap.

Also This Week

Teaching Claude Why — SDF on constitution + fiction collapses agentic blackmail · May 11 · Anthropic
→ SDF on a constitutional corpus plus fictional aligned-AI stories cut held-out agentic-blackmail rates 3×; cumulative 96% (Opus 4) to 0% (Haiku 4.5+), with a 3M-token "difficult advice" set giving ~28× efficiency over demonstrations.

The Hot Mess of AI: misalignment decomposed into bias and variance · May 2026 · alignment.anthropic.com · arXiv:2601.23045
→ Longer reasoning chains push errors from systematic toward stochastic; scale improves accuracy but not incoherence — safety cases built around coherent goal-pursuit are calibrating against the wrong distribution.

AI for Math Initiative — DeepMind + five institutions · May 12 · Google
→ Co-launched with Imperial, IAS, IHÉS, Simons, and Tata — the institutional scaffolding under Co-Mathematician; math-AI is getting its own venue stack.

World Models named to MIT Tech Review's 10 Things That Matter in AI · May 12 · MIT Tech Review
→ Analyst-consensus moment for the post-LLM track — V-JEPA, Genie, Cosmos, and LeCun's AMI Labs as a research program with its own data, evals, and scaling curve.

From the Lab

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling (AutoTTS) · Zheng et al., May 8 · arXiv:2605.08083
→ Formulates width-depth TTS as controller synthesis over pre-collected reasoning trajectories — an agent picks branch/continue/probe/prune/stop without repeated LLM calls, with beta parameterization making search tractable. Discovered controllers beat hand-designed baselines on math benchmarks and generalize to held-out tasks and model scales. Total discovery cost: $39.9, 160 minutes. The hand-tuned TTS heuristic just became a regression line.

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space (LDLM) · May 8 · arXiv:2605.07933
→ Trains the latent encoder, diffusion model, and decoder jointly — the image-diffusion recipe text had not cleanly pulled off. Beats discrete and continuous diffusion baselines at 2–13× decode speed. With contemporaneous Cola DLM (arXiv:2605.06548), latent diffusion is now the most credible non-autoregressive track.

Superposition Is Not Necessary: Mechanistic Interpretability of Transformer Representations for Time Series Forecasting · May 6 · arXiv:2605.05151
→ SAEs on PatchTST find the forecasting transformer doesn't rely on superposition — a counterexample to the assumption that SAE decomposition only makes sense for densely polysemantic LLM features. The "superposition is fundamental" framing needs caveats outside language.

Worth Reading

Navigating by Old Maps: Pitfalls of Static Mechanistic Localization in LLM Post-Training — locate-then-update fails when post-training shifts parameters under the located circuits; read before shipping interpretability-guided edits
Anthropic's "Teaching Claude Why" research note — public-facing companion to the alignment blog; cleaner on the SDF recipe
AutoTTS project page + code — cheapest way to see if your benchmark gives up new TTS controllers under the same search

Frontier results aren't arriving as model-card numbers anymore — they arrive as workflows, supervision recipes, and discovered controllers wrapped around a base that hasn't moved in three months. Whoever masters the wrapper sets the next benchmark.

Get more from AI Weekly

More signal, less noise — pick your channels.