arxiv.org web signal

Jain and Wallace: Attention Weights Don't Explain Predictions

TL;DR

  • Sarthak Jain and Byron C. Wallace argue standard attention modules in NLP do not provide meaningful explanations and should not be treated as though they do.
  • Across a variety of NLP tasks, learned attention weights were frequently uncorrelated with gradient-based measures of feature importance.
  • The authors could construct very different attention distributions that nonetheless yielded equivalent model predictions.

A 2019 paper from Sarthak Jain and Byron C. Wallace called 'Attention is not Explanation' is one of those NAACL results that quietly reshaped how careful people in NLP talk about model interpretability, and it remains the right starting point any time someone reaches for an attention heatmap as proof of what a model 'looked at'.

The setup is simple. Attention mechanisms in neural NLP models produce a distribution over input tokens, and that distribution is often presented, at least implicitly, as telling you which inputs mattered to the prediction. Jain and Wallace ran extensive experiments across a variety of NLP tasks to test whether that interpretation actually holds. Their headline finding, in their own words, is that learned attention weights are 'frequently uncorrelated with gradient-based measures of feature importance', and that you can identify 'very different attention distributions that nonetheless yield equivalent predictions'. If two different attention maps produce the same answer, neither map can honestly be called the explanation for that answer.

The practical reason this matters: attention visualizations are still a default UI for showing users what a model is doing, especially in domains where transparency matters legally or ethically. If the heatmap a clinician or auditor is looking at can be swapped for a very different heatmap without changing the model's output, then the heatmap is decorative, not evidentiary. The paper's caution against treating attention 'as though' it provides meaningful explanations sets a floor for what counts as a real interpretability claim.

The honest caveat is that the paper was written before transformer-era models dominated the field, and it tested attention as it existed in the NLP work of its moment. What the abstract does not give you is how the result extends to multi-head attention in modern large language models, or which alternative interpretability methods clear the bar the authors imply. Those are open questions follow-up work has been chewing on ever since.

The value of the original is the discipline it imposes. If you are shipping an attention view to users and calling it an explanation, the burden is on you to show it survives the kind of swap-test Jain and Wallace ran.

Shared on Bluesky by 1 AI expert