Machine Learning: Anthropic taught Claude to verbalize its thoughts May 13th 2026

Curated by Alexis

The dominant theme this week is not a new frontier model. It is the stack underneath catching up. Anthropic shipped a method to read activations as English. PyTorch shipped the first stable Blackwell driver matrix. And a diffusion LM from Kaiming He's lab cleared discrete autoregressive baselines with 10x fewer training tokens.

Watch & Listen First

Latent Space — Doing Vibe Physics, Alex Lupsasca, OpenAI (Latent.Space, May 5) — how GPT-5.x derived a new single-minus gluon amplitude, posted as an IAS/Vanderbilt/Cambridge/Harvard preprint
TWIML #767 — How to Find the Agent Failures Your Evals Miss, with Scott Clark (TWIML, May 7) — trace-to-vector fingerprints for surfacing unknown-unknowns in prod LLM systems
No Priors — Baseten CEO Tuhin Srivastava on the AI Inference Wars (Apple Podcasts, May 1) — 30x growth, 18 clouds, why inference is the strategic last market

Key Takeaways

Add NLAs to your interpretability stack. Anthropic's natural language autoencoders turn residual-stream activations into English with 0.6–0.8 FVE, catching unverbalized evaluation awareness that chain-of-thought audits miss.
Plan your CUDA 13.2 trial now. PyTorch 2.12 ships CUDA 13.2 experimental for Blackwell. CUDA 12.8 is removed from CI, CUDA 13.0 stays the PyPI stable build. Pin before the next release decides for you.
Re-evaluate diffusion LMs for non-coding workloads. ELF beats discrete diffusion baselines on translation and summarization with roughly 10x fewer training tokens. The autoregressive default is no longer obvious for non-code generation.
Audit your alignment training data, not just your prompts. Anthropic traced 96% blackmail rates on Opus 4 to internet evil-AI fiction. Constitution training plus admirable-behavior stories cut agentic misalignment more than 3x and zeroed the rate on Haiku 4.5.
Evaluation awareness is measurable now. NLAs flag eval awareness on 26% of SWE-bench Verified problems and 16% of destructive-action tests, versus under 1% of opted-in claude.ai traffic.

The Big Story

Anthropic Publishes Natural Language Autoencoders, Reads Claude's Internal State in English · May 7 · Transformer Circuits
→ NLAs are two LLM modules trained jointly with RL: an activation verbalizer that maps a residual-stream vector to a text description, and an activation reconstructor that maps the description back to activation space. The pair hits 0.6–0.8 FVE across Haiku 3.5, Haiku 4.5, and Opus 4.6. NLA-equipped auditors found root-cause misalignment in 12–15% of probe cases versus under 3% without NLAs, and surfaced unverbalized evaluation awareness on 26% of SWE-bench Verified problems.

→ This is the most consequential interpretability result since sparse autoencoders, because explanations land in the language practitioners debug in. Training code and pretrained NLAs ship on GitHub. The companion "Teaching Claude Why" paper drove Opus 4's 96% blackmail rate to zero by combining constitution training with admirable-AI stories, after tracing the original behavior to evil-AI tropes in the pretraining corpus (TechCrunch).

Also This Week

PyTorch 2.12 Lands With CUDA 13.2 Experimental, CUDA 12.8 Removed · May 13 · PyPI
→ CUDA 13.0 remains the PyPI default, CUDA 13.2 ships experimentally with the expanded Blackwell sm_120 path, CUDA 12.6 stays for Maxwell/Pascal/Volta (Dev Discuss). First release where the Blackwell Ultra driver matrix is officially supported rather than nightly.

DeepMind Reframes the Mouse Pointer as an AI Interaction Primitive · May 12 · Google DeepMind
→ A research preview turning the cursor into a context-aware agent for visual grounding and cross-app flow. The underlying assumption (per-pixel VLM reasoning at interaction latency) is what's quietly forcing inference economics across every consumer surface.

Hugging Face Trending Papers Skew Toward Visual Agent Harnesses · May 11 · HF Papers Week 20
→ Top of trending is the HKUST visual-native agent harness with image-bank reference protocol, producing reusable intermediate visual evidence for closed-loop multimodal search. Vision-language eval pipelines are converging on persistent visual scratchpads.

SemiAnalysis: Frontier Lab Margins Are Expanding Even as Token Prices Fall · May 1 · SemiAnalysis
→ Opus 4.5 shipped at one-third the price of prior Opus tiers, yet Opus-token margins are up via software and hardware co-design. Self-hosted open-weight economics now have to be benchmarked against that gap, not last year's API price.

From the Lab

"ELF: Embedded Language Flows" · arXiv 2605.10938
→ Hu, Qiu, Lu, Li, Kim, Andreas, and Kaiming He propose a continuous-time Flow Matching diffusion LM that stays in embedding space until the final step, where a shared-weight head maps to discrete tokens. The 105M-parameter ELF beats leading discrete and continuous DLMs on machine translation and summarization with roughly 10x fewer training tokens and fewer inference steps. Classifier-free guidance transfers cleanly from image diffusion. If you wrote off diffusion LMs after the 2024 wave, this re-opens the question.

Worth Reading

Anthropic — Natural Language Autoencoders research page — cleanest worked examples of evaluation awareness firing without CoT traces
Hugging Face Trending Papers — fastest single-page view of what the open community is converging on this week
PyTorch 2.12 / CUDA 13.2 thread on Dev Discuss — driver-matrix decisions and Blackwell timeline from the release engineers

The week's signal: interpretability became deployable, the inference-cost frontier moved into compiler and driver work, and the next architecture upset for language modeling is shaping up as a diffusion model from a vision lab.

Get more from AI Weekly

More signal, less noise — pick your channels.