transformer-circuits.pub web signal

Anthropic traces Claude 3.5 Haiku's inner reasoning circuits

TL;DR

  • Anthropic researchers apply attribution graphs, built on a cross-layer transcoder with 30 million features, to trace how Claude 3.5 Haiku arrives at answers.
  • For a Dallas capital query, the model activates an intermediate 'Texas' representation before selecting 'Austin', evidence of genuine two-hop reasoning.
  • The authors say their methods produce useful insight on about a quarter of prompts tried, and found forward planning in roughly half of examined poems.

Anthropic has published a long case-study report on how one of its own models actually computes an answer, and it is worth reading as much for the caveats as for the flashy findings. A team led by Jack Lindsey put On the Biology of a Large Language Model up on the transformer-circuits.pub research site, walking through a series of experiments in which they use attribution graphs, built on a cross-layer transcoder with roughly 30 million features, to trace how Claude 3.5 Haiku arrives at specific outputs.

The fun findings read like small mechanistic surprises. When the researchers ask the model for the capital of the state containing Dallas, the internal circuit activates an intermediate 'Texas' representation before selecting 'Austin', which looks like genuine two-hop reasoning rather than a memorized shortcut. In roughly half of the rhyming couplets they examined, the model appears to pick candidate end-of-line words early and then steer the intermediate words toward that target, a small kind of forward planning. They also describe shared 'multilingual features' that become more prominent with scale, hallucination circuits with a default 'can't answer' state that gets suppressed by a 'known entity' signal, and a letter-acrostic jailbreak that slips through because, as the authors put it, 'the model doesn't know what it plans to say until it actually says it'.

For practitioners, the interesting shift is that attribution graphs try to string interpretability features into causal chains you can perturb and test, rather than staring at isolated neurons. The circuits behind refusal and hallucination in particular are the ones a safety team would most want a handle on for audit purposes.

The honest caveat is that the authors are unusually candid about the limits. They say their tools yield useful insight on 'about a quarter of the prompts' they tried, and everything runs through a 'replacement model' that 'incompletely and imperfectly captures the original'. What the paper does not claim is that this method scales cleanly to bigger frontier systems, or that any specific mechanism will look the same in Anthropic's larger models. The case studies are best read as existence proofs, not a general theory of the model.

Still, this is the kind of interpretability work most likely to shape how safety cases get written for foundation models over the coming year. If attribution graphs can be pushed to larger systems without losing resolution, the audit story for frontier models gets meaningfully more concrete.

Shared on Bluesky by 2 AI experts