Three inflection points converged this week. Chinchilla's foundational assumption — that unique tokens are effectively infinite — now has a formal refutation in the form of prescriptive scaling laws for data-constrained regimes, published straight to arXiv mid-week. Simultaneously, Anthropic's attribution-graph toolchain cleared the lab-moat and became community infrastructure, and state space models earned an ICLR 2026 oral with results making a genuine case against the vanilla transformer on standard NLP workloads. The field is not diverging; it is consolidating around two axes: extracting more signal from finite data, and understanding what trained models are actually computing.

Get more from AI Weekly

More signal, less noise — pick your channels.

You're reading the weekly brief. Below are the other ways to follow the story — every channel free, easy to leave.

  • → Explore 16 deep dives
    Weekly topic-specific newsletters: Generative AI, Machine Learning, AI in Business, Robotics, Frontier Research, Geopolitics, Healthcare, and more.
    Browse all 16 deep dives →
  • → Breaking AI alerts
    When something major breaks (a $60B acquisition, a regulator's emergency meeting, a frontier model leak), alert subscribers know within hours. Typically 0-2 emails per day.
    Get breaking alerts →
  • → AI News Today (live)
    Live dashboard updated as the scanner finds news: scored stories from the last 48 hours, weekly entity movers, and quarterly trend lines across 113 AI companies, people, and topics.
    Open AI News Today →

Watch & Listen First

State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, AGI — Lex Fridman Podcast #490 with Sebastian Raschka and Nathan Lambert (Allen Institute for AI). A 4.5-hour technical audit of the LLM landscape by two practitioners who understand training pipelines, not just benchmark dashboards.
YouTube · Spotify

The Utility of Interpretability — Emmanuel Amiesen, Anthropic · Direct companion to this week's circuit tracer open-source release; Amiesen walks through what attribution graphs reveal and where they structurally break down.
Latent Space


Key Takeaways

  • The Chinchilla regime is over for frontier labs. New prescriptive scaling laws show that past the unique-token threshold, data repetition strictly hurts — compute should buy model capacity instead.
  • Mamba-3 MIMO posts +1.8pp accuracy over Gated DeltaNet at 1.5B scale at half Mamba-2's state size: the SSM competitive case is no longer theoretical.
  • Attribution graphs are now community infrastructure. Anthropic's circuit tracer runs on Gemma-2-2b and Llama-3.2-1b; the lab-moat on mechanistic interpretability tooling is gone.
  • Four Chinese open-weights coding models hit Western frontier parity at ≤1/3 inference cost in 12 days — commoditization is present-tense, not a trend.
  • ClawBench clocks the best frontier model at 33.3% on 144 live production websites — the most honest agent capability measure yet published.

The Big Story

Prescriptive Scaling Laws Formalize When Data Repetition Becomes Counterproductive · May 2, 2026 · arXiv 2605.01640

Chinchilla assumed infinite unique tokens; this paper replaces that fiction with a closed-form prescription. Lovelace, Belardi, Kundurthy et al. model excess loss under token repetition as an additive overfitting penalty and derive that beyond a dataset-dependent threshold, repeating tokens is strictly dominated by spending the same FLOP budget on model capacity. The decisive empirical lever is weight decay: λ=1.0 reduces the overfitting coefficient by ~70%, providing the first scaling-law-grounded justification for the much-larger weight decay values labs have been quietly using in data-constrained runs — an order of magnitude above standard practice. For any team whose unique token count is below compute-optimal, this paper defines the rational budget allocation going forward.


Also This Week

Anthropic Open-Sources Circuit Tracer, Bringing Attribution Graphs to Community Models · May 2026 · Anthropic Research
→ Researchers can now generate Anthropic-style attribution graphs on Gemma-2-2b and Llama-3.2-1b and probe hypotheses by modifying feature activations directly — mechanistic interpretability can now scale with community contributors rather than being gated by model access.

Four Chinese Open-Weight Coding Models Hit Western Frontier Parity in 12 Days · April 2026 · Source
→ GLM-5.1, MiniMax M2.7, Kimi K2.6, and DeepSeek V4 match Western-frontier agentic-engineering benchmarks at ≤1/3 the inference cost of Claude Opus 4.7 — the inference-cost curve just changed structurally for anyone building on frontier-class models.

ClawBench Runs 153 Agent Tasks on 144 Live Production Websites, Best Score: 33.3% · April 2026 · UBC / Vector Institute
→ Unlike sandboxed evals, ClawBench measures against real production state; Claude Sonnet 4.6's 33.3% top score should replace optimistic sandbox numbers as the calibration point for any web-agent deployment decision.


From the Lab

Prescriptive Scaling Laws for Data Constrained Training · arXiv 2605.01640
→ The authors fit penalty term ε(n_rep, N) to empirical loss curves spanning 70M–7B model sizes and 1×–16× repetition counts, finding the penalty grows super-linearly with repetitions but sub-linearly with model capacity — exactly why larger models tolerate modest data repetition better. The closed-form transition point prescribes when scaling parameters dominates scaling data, and the weight decay result (λ=1.0 cuts the overfitting coefficient ~70%) reframes the "more data is always better" heuristic as a special case valid only in token-abundant regimes. Every frontier training run planned for H2 2026 under data constraints should be recalculated against this.

Mamba-3: Improved Sequence Modeling using State Space Principles · arXiv 2603.15569 · ICLR 2026 Oral
→ Three compounding innovations: Generalized Trapezoidal Rule discretization replaces first-order Euler approximation with a second-order recurrence; complex-valued states are reframed as real-valued SSM + data-dependent RoPE (preserving decode latency while capturing oscillatory dynamics); and a MIMO formulation enables multi-channel state mixing absent from Mamba-1/2. At 1.5B scale, Mamba-3 MIMO posts +1.8pp average downstream accuracy versus Gated DeltaNet at half Mamba-2's state size — the first SSM result that competes on standard downstream NLP benchmarks, not just long-context or streaming-inference applications. ICLR oral status signals community consensus on architecture-level significance.

Open-Source Circuit Tracing: Attribution Graphs on Open-Weights Models · Anthropic Research
→ The library generates attribution graphs on Gemma-2-2b and Llama-3.2-1b and supports interactive visualization via Neuronpedia; crucially, hypothesis testing through direct feature-activation modification converts attribution graphs from passive read artifacts into active experimental instruments. This infrastructure shift positions mechanistic interpretability as an empirical discipline with shared tooling rather than a collection of one-off Anthropic case studies — the downstream implication is that circuit-tracing results will now compound across labs.


Worth Reading


Data-constrained scaling laws are the new Chinchilla — and every 2026 training run budget should be recalculated against them before the first token flows.

Sources: