arxiv.org web signal

Wiegreffe and Pinter push back on 'attention is not explanation'

TL;DR

  • Wiegreffe and Pinter respond to Jain and Wallace (2019), arguing that whether attention explains a model depends on how explanation is defined.
  • The paper proposes four tests: a uniform-weights baseline, variance calibration over random seeds, a frozen-weights diagnostic, and adversarial attention training.
  • Even when reliable adversarial attention distributions are found, the authors report they fail the simple diagnostic, so prior work has not disproven attention's explanatory use.

A 2019 EMNLP paper that is still cited every time someone argues about whether attention weights mean anything is Wiegreffe and Pinter's rebuttal to Jain and Wallace's earlier 'Attention is not Explanation'. The title is deliberately awkward, and the argument is narrower than the slogan. The authors are not claiming attention always explains a model. They are claiming the earlier paper did not show it never can.

Their move is methodological. Whether attention is explanation, they argue, depends on how you define explanation and on whether your experiment actually controls for the rest of the model. To make that concrete they propose four tests for use on RNN models: a uniform-weights baseline so you can see what 'no attention signal' would look like, a variance calibration across multiple random seeds so you can tell signal from run-to-run noise, a diagnostic that freezes pretrained weights, and an end-to-end adversarial attention training protocol that asks whether a different attention distribution could produce the same prediction. The headline result is that even when reliable adversarial distributions can be found, the authors report they don't perform well on the simple diagnostic. In their reading, that means prior work did not disprove attention's usefulness for explainability.

Why this matters for anyone shipping models: 'the model attended to these tokens' is one of the most common explanations surfaced in product dashboards and audit reports. If that claim is doing real work in a regulated decision, it should survive the kind of baselines and seed-variance checks this paper spells out, not just a screenshot of a heatmap.

The honest caveat is that the paper is scoped to RNN-era architectures and to a specific definition of explanation, and it is a rebuttal, not a closing argument. What the abstract does not give you is a clean answer on transformers, on which tasks adversarial attention reliably fails the diagnostic, or on a definition of 'faithful' explanation that both sides accept. The contribution that holds up is the test battery. Treat it as the minimum bar interpretability work should clear before anyone calls an attention map an explanation.

Shared on Bluesky by 1 AI expert