arxiv.org web signal

ML Paper Argues Interpretability Should Count as Evaluation

TL;DR

  • The paper argues interpretability methods that are falsifiable, reproducible, and predictive can serve as model evaluation, not just diagnostics.
  • Of four methods assessed in Table 1, attention mechanisms fail all three criteria; sparse autoencoders fail reproducibility.
  • An SAE refusal-detection feature trained on chat data failed to generalize when the target model received webtext input instead.

Benchmark scores and win rates are how we typically decide whether a model is ready, but a paper on arXiv by Isabelle Lee, Emmy Liu, Cathy Jiao, Brihi Joshi, Dani Yogatama, Fazl Barez, and Michael Saxon argues that something is missing from that picture. Current evaluation, the authors write, relies on "behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics," but two models may achieve identical behavior while relying on radically different internal mechanisms. Understanding why a model produces a behavior, they contend, "can be as important as measuring what it produces."

The paper's proposal is that interpretability, done rigorously, should count as model evaluation in its own right -- not a post-hoc explanation layer, but a principled assessment tool. For that to hold, interpretability methods must meet three scientific standards: their claims must be falsifiable, reproducible, and predictive. Each standard maps to an evaluative function. Falsifiability enables debugging by identifying root causes of unwanted behavior. Reproducibility makes it possible to detect subtle failures that output metrics miss, such as a model that assumes masculine gender for "El doctor" based on profession correlations in its training data. Predictivity allows teams to anticipate failures before they occur.

Against those three criteria, the paper evaluates four methods: sparse autoencoders, concept bottleneck models, attention mechanisms, and probing. The scorecard is sobering. Attention mechanisms fail all three. Sparse autoencoders fail reproducibility: a guardrail "refusal" feature identified by activations from chat-formatted data failed to generalize when the target model was provided with webtext input instead. Concept bottleneck models pass falsifiability but fail predictivity. Only probing consistently passes reproducibility.

The honest conclusion is that the authors are describing a destination the field has not reached. What the paper does not provide is a roadmap for getting there, only the argument that the destination is worth pursuing. The groups most likely to benefit if the field responds are those building in safety-critical domains, where knowing why a model succeeded matters as much as knowing that it did.

Shared on Bluesky by 2 AI experts