SOOHAK benchmark exposes AI math reasoning ceiling
Key insights
- No AI model surpasses 50% accuracy at identifying the 99 deliberately unsolvable problems embedded in SOOHAK.
- Scaling compute improves math problem-solving scores but produces no measurable gain in recognizing unsolvable inputs.
- Gemini 2.5 Pro leads at 30% on research tasks while open-weight models score below 15%, with GPT-5 at 26%.
Why this matters
Any deployment where an AI model must decide whether a problem is well-formed before solving it — legal reasoning, scientific hypothesis validation, engineering constraint checking — is exposed by this structural gap. The finding that compute scaling decouples from epistemic calibration means the standard industry lever for improving reliability does not fix this class of failure. Teams building agentic systems that hand off math or logic subtasks to models should treat confident outputs on malformed inputs as a live production risk, not a theoretical one.
Summary
64 mathematicians built a benchmark designed to break AI math confidence, and it worked. SOOHAK pairs 340 research-level problems with 99 deliberately unsolvable ones, then measures whether models can tell the difference. The results are bruising: Gemini 2.5 Pro leads on legitimate research tasks at 30%, GPT-5 follows at 26%, and open-weight models fall below 15%. On the harder meta-task of flagging flawed problems, no model clears 50% accuracy. Qwen3 collapses to under 3%. GLM-5 performs best at just under 50%.
The structural finding is the one that stings. Scaling compute improves raw problem-solving scores but produces zero measurable improvement in a model's ability to recognize when a question has no valid answer. The capability that scales and the capability that matters for real-world deployment are diverging.
Essentially: (Google, OpenAI, Qwen, GLM) all ship models that confidently engage with broken inputs.
- Gemini 2.5 Pro leads research tasks at 30%; GPT-5 at 26%; open-weight models below 15%.
- No model exceeds 50% on unsolvable-problem detection; Qwen3 scores below 3%.
- More compute improves solving but shows no gain in knowing when to stop.
The gap between benchmark performance and epistemic reliability is now a documented, measurable quantity that current training objectives have no clear path to close.
Potential risks and opportunities
Risks
- AI-assisted math tutoring platforms (Photomath, Wolfram Alpha, Khan Academy's Khanmigo) face reputational exposure if students receive confident, detailed solutions to problems with no valid answer.
- Agentic coding and reasoning pipelines that use frontier models for constraint validation — including those at Google DeepMind and OpenAI — may be silently propagating downstream errors from malformed problem inputs.
- Open-weight model deployers (Alibaba Qwen, Zhipu GLM) face credibility pressure from Qwen3's near-zero score on unsolvable detection, potentially accelerating customer migration to closed-weight alternatives in high-stakes verticals.
Opportunities
- Benchmark-specialized fine-tuning vendors and RLHF shops (Scale AI, Imbue, Cohere) can productize unsolvable-problem rejection as a targeted alignment capability and charge for it as a reliability audit.
- Math-heavy enterprise verticals (quantitative finance, pharma trial design, chip EDA) gain a concrete procurement criterion: SOOHAK unsolvable-detection score as a vendor shortlist filter.
- Academic and independent benchmark builders gain leverage to establish SOOHAK or derivatives as a required evaluation tier, similar to how MMLU became a baseline, pressuring labs to report scores publicly.
What we don't know yet
- Whether any lab has shared internal SOOHAK scores or privately benchmarked against it since the paper's release in May 2026.
- Whether fine-tuning or RLHF specifically targeting unsolvable-problem rejection produces measurable gains, which the paper does not address.
- Which of the 99 unsolvable problem categories (ambiguous constraints, missing conditions, contradictory premises) cause the steepest accuracy drops across model families.
Originally reported by the-decoder.com
Read the original article →Original headline: SOOHAK Benchmark: 64 Mathematicians Build 439-Task Test Including 99 Unsolvable Problems — No AI Model Exceeds 50% at Detecting Flawed Questions