arxiv.org web signal

Andreas group reproduces attention heads as Python code

TL;DR

  • Under 1,000 LM-generated programs reproduce attention patterns in GPT-2, TinyLlama-1.1B and Llama-3B with above 75% average IoU on TinyStories.
  • Replacing 25% of attention heads with the synthesized programs raised average perplexity by only 16% while keeping downstream question-answering performance.
  • The pipeline computes a head's attention matrices, prompts a pretrained LM to write Python that reproduces them, then re-ranks by held-out accuracy.

A new paper on arXiv quietly chips at the assumption that mechanistic interpretability has to mean either hand-built circuits or sparse autoencoders. In Explaining Attention with Program Synthesis, Amiri Hayes, Belinda Z Li and Jacob Andreas take the attention head out of vector-land and try to rewrite it as Python.

The pipeline is concrete. For a given head, they compute its attention matrices on randomly selected training examples, prompt a pretrained language model with a summary of those matrices, and instruct it to generate Python programs that reproduce the patterns from the input sentence alone. The programs are then re-ranked by how well they predict behavior on held-out inputs. The headline number is that fewer than 1,000 such programs are enough to cover attention heads across GPT-2, TinyLlama-1.1B and Llama-3B, with an average Intersection-over-Union similarity above 75% on TinyStories.

The sterner stress test is the substitution one. Swap 25% of the attention heads in each model for their programmatic surrogates and you pay a 16% average perplexity increase while keeping performance on a variety of downstream question-answering benchmarks. That is not a free swap, but it is a measurable one, and it suggests the synthesized code is doing something closer to a head's actual job than a curve fit.

The honest caveat is that an IoU above 75% still leaves a real chunk of behavior unmodeled, and the experiments here live on relatively small open models. There is no claim in the paper that the same pipeline holds at frontier scale, and using a pretrained language model to explain another language model is a circularity an external auditor would reasonably want to probe.

What makes this worth tracking is the artifact more than the benchmark. A Python program describing what an attention head does is something you can grep, diff between checkpoints, share with reviewers and reason about without touching a GPU. If the small-model results extend, the open-model community gets a credible path toward shipping human-readable documentation alongside weights, which is a shift in what interpretable even means in practice.

Shared on Bluesky by 2 AI experts