Three results this week argue the frontier is moving up a level: agentic systems doing original math, alignment delivered through documents not demonstrations, and inference-time strategies discovered automatically. None are scale stories — all three are about what you wrap around the model.

Get more from AI Weekly

More signal, less noise — pick your channels.

You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.

  • → Explore 16 deep dives
    Weekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.
    Browse all 16 deep dives →
  • → Breaking AI alerts
    When something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.
    Get breaking alerts →
  • → AI News Today (live)
    Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.
    Open AI News Today →

Watch & Listen First


Key Takeaways

  • Math SOTA decoupled from the base model. Co-Mathematician's 48% on FrontierMath Tier 4 is a workflow result on Gemini 3 Deep Think — treat hard-benchmark SOTA as a system-level claim from here on.
  • Document-based alignment is credible. Anthropic's SDF-on-constitution (96% → 0% blackmail, Opus 4 to Haiku 4.5+) shifts alignment R&D from demonstration-heavy SFT toward principles + character. Re-cost your safety pipelines.
  • Test-time scaling strategies are search problems. AutoTTS discovered TTS controllers that beat hand-designed baselines for $39.9. The hand-tuned width/depth-schedule era is over.
  • Reasoning failures are incoherent, not adversarial. Hot Mess keeps holding: as trajectories lengthen, error shifts systematic → random. Plan evals around incoherence, not goal-misalignment red-teaming.
  • Latent diffusion LMs cleared the joint-training bar. Two papers in seven days beat discrete diffusion at 2–13× decode speed. The architecture race below the apex is no longer hypothetical.

The Big Story

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI · May 7, 2026 · arXiv:2605.06651 · DeepMind blog
48% on FrontierMath Tier 4 — vs 22.9% for Claude Opus 4.7 — from a workbench wrapping Gemini 3 Deep Think, not a new base model. A project-coordinator agent dispatches specialists across parallel workstreams, keeps a failure-hypothesis log, emits LaTeX with provenance. Interesting pathology: "reviewer-pleasing bias" — critic and prover share a prior, so self-review converges on subtly wrong arguments. A named failure mode adjacent to sycophancy, worth its own evals. Oxford's Marc Lackenby closed an open Kourovka Notebook problem with it after the reviewer flagged a gap.


Also This Week

Teaching Claude Why — SDF on constitution + fiction collapses agentic blackmail · May 11 · Anthropic
SDF on a constitutional corpus plus fictional aligned-AI stories cut held-out agentic-blackmail rates 3×; cumulative 96% (Opus 4) to 0% (Haiku 4.5+), with a 3M-token "difficult advice" set giving ~28× efficiency over demonstrations.

The Hot Mess of AI: misalignment decomposed into bias and variance · May 2026 · alignment.anthropic.com · arXiv:2601.23045
Longer reasoning chains push errors from systematic toward stochastic; scale improves accuracy but not incoherence — safety cases built around coherent goal-pursuit are calibrating against the wrong distribution.

AI for Math Initiative — DeepMind + five institutions · May 12 · Google
Co-launched with Imperial, IAS, IHÉS, Simons, and Tata — the institutional scaffolding under Co-Mathematician; math-AI is getting its own venue stack.

World Models named to MIT Tech Review's 10 Things That Matter in AI · May 12 · MIT Tech Review
Analyst-consensus moment for the post-LLM track — V-JEPA, Genie, Cosmos, and LeCun's AMI Labs as a research program with its own data, evals, and scaling curve.


From the Lab

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling (AutoTTS) · Zheng et al., May 8 · arXiv:2605.08083
Formulates width-depth TTS as controller synthesis over pre-collected reasoning trajectories — an agent picks branch/continue/probe/prune/stop without repeated LLM calls, with beta parameterization making search tractable. Discovered controllers beat hand-designed baselines on math benchmarks and generalize to held-out tasks and model scales. Total discovery cost: $39.9, 160 minutes. The hand-tuned TTS heuristic just became a regression line.

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space (LDLM) · May 8 · arXiv:2605.07933
Trains the latent encoder, diffusion model, and decoder jointly — the image-diffusion recipe text had not cleanly pulled off. Beats discrete and continuous diffusion baselines at 2–13× decode speed. With contemporaneous Cola DLM (arXiv:2605.06548), latent diffusion is now the most credible non-autoregressive track.

Superposition Is Not Necessary: Mechanistic Interpretability of Transformer Representations for Time Series Forecasting · May 6 · arXiv:2605.05151
SAEs on PatchTST find the forecasting transformer doesn't rely on superposition — a counterexample to the assumption that SAE decomposition only makes sense for densely polysemantic LLM features. The "superposition is fundamental" framing needs caveats outside language.


Worth Reading


Frontier results aren't arriving as model-card numbers anymore — they arrive as workflows, supervision recipes, and discovered controllers wrapped around a base that hasn't moved in three months. Whoever masters the wrapper sets the next benchmark.