paper web signal

Reasoning Models Decide Safety Before Any Thinking Begins

TL;DR

  • A classifier on the first token's hidden state predicts refusal/compliance at 0.84–0.95 AUROC, before any visible thinking occurs.
  • Final safety outcomes rarely change after the first ~20% of thinking tokens across tested open-weight model families.
  • Current safety interventions shift models toward over-refusal while suppressing already-scarce genuine deliberation signals.

A core promise of extended-thinking reasoning models is that the visible deliberation improves safety: the model works through the request before deciding whether to comply. A new paper, "Do Thinking Tokens Help with Safety?" by Narutatsu Ri, Abhishek Panigrahi, and Sanjeev Arora, tests that assumption directly and finds it largely does not hold.

The researchers trained a small classifier on the hidden representation of the very first generated token, before any reasoning text is produced, and found it could predict whether a model would ultimately refuse or comply with AUROC scores between 0.84 and 0.95 and balanced accuracy of around 88%. This held across frontier open-weight models from the GPT-OSS, Qwen, Olmo, and Phi families. The outcome is effectively settled before the thinking starts.

The paper's characterization of what is actually happening is pointed. The thinking process, the researchers write, is "more akin to prefix completion than to deliberative revision" — the model generates text that looks like deliberation, but the outcome distribution has already converged. Around 74% of text-level deliberations occur after the response distribution is already locked to one side, and final outcomes rarely change after the first roughly 20% of thinking tokens.

The finding about safety interventions adds a further complication: existing inference-time and training-based approaches designed to induce deliberation largely shift models toward over-refusal while suppressing the already-scarce genuine deliberation signals, according to the paper. The honest caveat is that the study covers open-weight models only, and whether these findings generalize to closed frontier systems is a question the paper does not address. What training approach would actually produce genuine deliberation is also left open.

The constructive thread is that the paper clarifies where the real target is: if safety behavior is encoded in first-token hidden representations, then interpretability and intervention research can aim at that layer directly, rather than at the generated deliberation text. The authors close by calling for methods that induce "real safety deliberation" — a more precise research agenda than the field had before.