Audit finds no LLM satisfies all four thought-axiom tests
TL;DR
- The paper tests latent thought representations against four formal axioms (Causality, Minimality, Separability, Stability) independently of downstream benchmark scores.
- Across Llama-3.1 8B, Llama-3.3 70B, DeepSeek-R1-Distill-Qwen-32B, Skywork-OR1-32B and GPT-OSS-20B, no candidate satisfies all four axioms simultaneously.
- Representations distinguish task type reliably but cannot tell apart two questions within the same task, encoding little beyond the input embedding.
A new audit of 'thinking' inside large language models is the kind of finding that should make you suspicious of the leaderboard story. Fahd Seddik and Fatemeh Fard propose four formal axioms, Causality, Minimality, Separability and Stability, that a latent thought representation ought to satisfy if it is really doing reasoning work, and then test five open-weight models against them. None of the five satisfies all four.
The setup matters because the axioms are deliberately defined to be benchmark-independent. Rather than scoring chain-of-thought outputs on a reasoning task, the metrics are computed on the representation itself: does the internal 'thought' actually substitute for the prefix tokens in the computational graph (Causality), does it compress the input while retaining what matters (Minimality), can a bounded projection separate semantically different outputs (Separability), and does it encode a distribution over semantically equivalent answers rather than collapsing to one (Stability). Two finer-grained findings sharpen the picture. The representations distinguish task type reliably but cannot tell apart two questions within the same task, and they 'encode little information beyond what is already present in the input embedding.'
The audit spans Llama-3.1 8B, Llama-3.3 70B, DeepSeek-R1-Distill-Qwen-32B, Skywork-OR1-32B and GPT-OSS-20B across 23 reasoning tasks. That mix is the point. Dense base models, a reasoning-distilled variant and RL-trained families all share the same gap, which the authors read as a structural rather than a training-specific problem.
The honest caveat is that this is a single recent paper on open-weight models, so take the universality framing as a working hypothesis. The audit does not reach frontier closed models, and a failed axiom is not the same as a demonstrably wrong answer. What the reporting does not give you is whether labs can train against these axioms without losing benchmark points, or whether axiom failure causally explains specific reasoning errors rather than being correlated with them.
If the framework holds up, the people who benefit fastest are evaluators and red-teamers who want a probe that does not saturate the way standard reasoning benchmarks have, and lab teams who can target the specific axiom their architecture leaks on before the next training run.
Originally reported by paper
Read the original article →Original headline: No LLM Satisfies All Four Axioms of Thought Representation