arXiv paper finds token-level signatures of reasoning failure
TL;DR
- A new arXiv paper splits LLM reasoning failures into two distinct modes: committed failure and persistent uncertainty.
- The framework's falsifiable predictions held in 20 of 23 model-dataset configurations the authors tested.
- The authors propose using token-level signals to decide when self-consistency sampling can be skipped, cutting inference cost.
A new arXiv preprint argues that when a language model gets a reasoning problem wrong, it doesn't usually fail at the end. It fails early, and the failure leaves a token-level fingerprint you can read off the trace. The paper, "How Language Models Fail" by Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, and Mykel J. Kochenderfer, splits reasoning breakdowns into two distinct modes and asks which signals in the generation actually identify them.
The first mode the authors call committed failure: the model "locks onto an incorrect reasoning path early in its trace." Their diagnostic for it is what they name a commitment point, "beyond which considering additional tokens hurt rather than help failure detection." The second mode is persistent uncertainty, where the signal accumulates across the whole trace and you need the full thing to tell what is going wrong. They report the framework's falsifiable predictions holding in 20 of 23 model-dataset configurations they tested, which they describe as well above chance.
The practical hook is self-consistency, the standard trick of sampling several completions and majority-voting them, which is reliable but expensive on long reasoning traces. The authors claim their framework can identify "when uncertainty signals complement it and when it can be selectively skipped." If that survives contact with workloads outside their test set, it's a route to cheaper inference on reasoning tasks, because operators would only pay for the extra samples when the token-level signature says they actually need them.
The honest caveat is what the abstract on arXiv doesn't give you. It doesn't name the 23 configurations, so we can't see how the result distributes across model families, sizes, or task types, and three misses out of 23 leave room for category-specific blind spots in the framework itself. It also doesn't say whether spotting a committed failure at the commitment point gives an operator any path to recover the trace, or only a signal to throw it away.
If the signatures generalize, the people who benefit first are teams running large reasoning workloads where self-consistency is the main inference cost line, and evaluation tooling vendors who finally get a trace-level diagnostic to plug into their harnesses.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures