paper web signal

LOCOS Finds LLM Attention Heads That Synthesize Meaning

TL;DR

  • LOCOS scores each attention head by projecting its OV-circuit output onto the answer-token unembedding direction in a single forward pass.
  • On Qwen3-8B, ablating the top 50 LOCOS heads drove NoLiMa ROUGE-L from 0.401 to 0.000; the strongest baseline still retained 0.292.
  • The same ablation dropped MuSiQue from 0.55 to 0.08 and BABI-Long from 0.62 to 0.20 while parametric recall and arithmetic stayed at baseline.

Interpretability work on large language models has spent the last couple of years trying to pin down which attention heads actually pull an answer out of a long context. The received recipe is to look at where each head attends and check whether the attended token matches the token the model generates. That works when the model is literally copy-pasting a fact from the prompt, but it misses the case that matters most in practice, where the model reads a span and writes an answer that shares no tokens with the source.

A new arXiv paper, Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads by Aryo Pradipta Gema, Beatrice Alex, and Pasquale Minervini, argues those standard detectors miss these heads by construction. Their proposed method, LOCOS, scores each head by projecting its OV-circuit output onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. In plainer words, it grades a head by what it writes into the residual stream, not just where it reads from.

The evidence that this measures something real is the ablation. On the NoLiMa non-literal retrieval benchmark, mean-ablating the top 50 LOCOS-selected heads in Qwen3-8B drives ROUGE-L from 0.401 to 0.000, while the strongest attention-based baseline still leaves the model at 0.292 after cutting the same number of its own top-ranked heads. The specificity claim holds up in the same run: the ablation drops MuSiQue from 0.55 to 0.08 and BABI-Long from 0.62 to 0.20, but parametric recall and arithmetic reasoning stay at baseline, and a random-heads control stays within 0.05 of baseline. The authors report the pattern replicating across Qwen3, Gemma-3, and OLMo-3.1.

The honest caveat is that the reporting available is the arXiv abstract, so how uniformly those top heads overlap across the three model families, how the method behaves on models much larger than 8B, and how sensitive the scoring is to NoLiMa's specific construction are questions the abstract does not resolve. Take the numbers as reported on Qwen3-8B, not as settled facts about every long-context model.

For anyone auditing long-context behavior or considering pruning heads before distilling a smaller model, this is the kind of probe worth having in the kit: a write-side complement to read-side detectors, one that separates a head that finds the right token from a head that carries the meaning.