academic.oup.com via Reddit

GPT-4o, Claude 3.5 Fail Stroop Attention Test at Scale

Key insights

  • GPT-4o's incongruent color-naming accuracy collapses from 91% at 5 words to 15% at 40 words on the Stroop task.
  • Neither model showed statistically significant trial-to-trial adaptive control, the executive mechanism humans use to resolve sustained conflict.
  • Word-reading accuracy held at 99-100% across all list lengths, showing models retain fluency while losing interference suppression.

Why this matters

Transformer models are routinely deployed in long-context workflows where sustained interference management is exactly what is required, yet the study finds GPT-4o's incongruent accuracy hits 1% in mixed 40-word conditions. Current capability benchmarks test models on short, isolated prompts and are structurally blind to this degradation curve, meaning production deployments have no standard eval to surface the failure before it occurs. Authors Patel, Wang, and Fan characterize the gap as architectural rather than a data or scale issue, which frames the problem as unlikely to resolve through the next generation of scaling alone.

Summary

GPT-4o and Claude 3.5 Sonnet break down on extended Stroop lists, per a PNAS Nexus study, while word-reading accuracy stays intact throughout. GPT-4o drops from 91% to 15% on incongruent items between 5 and 40 words. Claude 3.5 Sonnet holds at 76% through 20 words before dropping to 24% at 40. In mixed conditions at 40 words, GPT-4o reaches 1% incongruent accuracy. Essentially: (GPT-4o, Claude 3.5 Sonnet) retain linguistic fluency but lose conflict suppression at scale. - Neither model showed trial-to-trial adaptive control that humans use after encountering interference. - Word-reading held at 99-100% across all lengths, isolating the failure specifically to conflict suppression. - Authors Patel, Wang, and Fan frame the deficit as architectural, not a scaling problem. Benchmarks testing models on short prompts miss degradation that already exists under extended cognitive load.

Potential risks and opportunities

Risks

  • Enterprises using GPT-4o or Claude 3.5 Sonnet for extended legal review, compliance auditing, or multi-step reasoning may face undetected accuracy failures as task lists grow beyond 20 items.
  • Benchmark providers such as MLCommons and Hugging Face Open LLM Leaderboard risk certifying models as capable in domains where this failure mode already exists, eroding trust in AI evals as a quality signal.
  • AI labs scaling transformer architectures may face compounding reliability issues in agentic, multi-turn workflows if the executive-control deficit is architectural rather than fixable through additional data or compute.

Opportunities

  • Cognitive architecture researchers at labs working on neuro-symbolic or executive-control approaches gain a concrete, reproducible benchmark to validate alternative designs against transformer baselines.
  • Eval vendors such as Scale AI could differentiate offerings by adding extended-list Stroop-style interference tests to model assessment suites that currently miss this failure mode.
  • Enterprise AI buyers gain a practical pre-deployment screen for long-context reliability using a Stroop-style 40-item eval that current provider benchmarks do not include.

What we don't know yet

  • Full numeric results for supplementary models GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro not reported in primary findings.
  • Whether the same degradation pattern appears in real-world deployed long-context tasks beyond controlled laboratory Stroop conditions.
  • Whether any targeted architectural intervention, such as explicit conflict-monitoring layers, can recover adaptive control without full model retraining.