Isabelle Lee

Benchmarks can be superficial, but model explanations and evaluations are fundamentally intertwined. What if we used interpretability as principled, scientific evaluation? If it met scientific standards? arxiv.org/abs/2605.05508 coming to EvalEval at ACL as oral 🧵 1/6

Rigorous Interpretation Is a Form of Evaluation arxiv.org

AI Weekly's analysis →

The paper argues interpretability methods that are falsifiable, reproducible, and predictive can serve as model evaluation, not just diagnostics.
Of four methods assessed in Table 1, attention mechanisms fail all three criteria; sparse autoencoders fail reproducibility.
An SAE refusal-detection feature trained on chat data failed to generalize when the target model received webtext input instead.

Read full analysis →

View on Bluesky · ♥ 13 ↻ 1 ↩ 1 · 2 from the directory shared this · 44d ago

Articles & links

In Isabelle Lee's orbit