arxiv.org web signal

'Introspective coupling': LM rationales track live behavior

TL;DR

  • LMs trained on fixed counterfactual explanations from earlier checkpoints often produce rationales more faithful to current behavior than to training targets.
  • The effect, dubbed 'introspective coupling,' appears across sycophancy and refusal tasks and is reported to be robust to label noise.
  • When explanation training runs alongside other post-training objectives, rationales track behavioral shifts without any updated supervision.

An unusual result from a new arXiv paper by Zifan Carl Guo, Laura Ruis, Jacob Andreas, and Belinda Z. Li: when you train a language model to explain its own decisions using a fixed dataset of counterfactual explanations, including explanations lifted from an earlier checkpoint of itself or from a behaviorally similar model in a different family, the explanations you end up with are often more faithful to the model's current behavior than to the training targets they were supervised on.

The authors call this 'introspective coupling.' The mechanism they propose is that training explanations remain sufficiently correlated with what the model is doing as its behavior shifts during training, and that correlation is enough for the rationales to keep tracking. They report the effect across multiple tasks, including sycophancy and refusal, and say it holds under label noise. They also show that when explanation training is provided concurrently with other post-training objectives, explanations track those behavioral shifts without any updated supervision.

The reason this matters if you rely on model-generated rationales as evidence about a system's actual behavior: a lot of interpretability and auditing work assumes the failure mode to worry about is 'the model just imitates the fine-tuning target and lies about its real reasoning.' The paper describes a different regime where rationales quietly track the model even when the supervision is frozen. That is both a hopeful result, since the authors argue that fixed datasets of counterfactual explanations can provide a 'scalable and generalizable post-training signal for introspection,' and a caution, because an explanation drifting with the behavior is not the same as an explanation being correct.

The honest caveat is that this is a preprint tested on a narrow set of tasks. What the reporting doesn't give you is a stress test against sharp behavioral shifts, adversarial probes, or high-stakes reasoning; sycophancy and refusal are useful setups but not a stand-in for those. For anyone building on self-explanations as an alignment or auditing signal, the mechanism the paper describes is worth reading carefully, and worth testing on your own model before trusting it.

Shared on Bluesky by 2 AI experts