reddit.com via Reddit

DataCurve Corrects GPT-5.5 SWEBench Score to 86.7%

openai research benchmarks coding-ai model-evaluation

Key insights

  • DataCurve found 28.9% of SWEBench Pro test cases are broken or incorrect, skewing results across evaluated models.
  • GPT-5.5's corrected SWEBench Pro score rises to 86.7% after removing failures attributable to flawed benchmark tests.
  • The reliability problem extends to MMLU and ARC-AGI, suggesting systemic benchmark infrastructure issues beyond SWEBench.

Why this matters

Benchmark scores are the primary signal AI practitioners and investors use to compare model capabilities, so a 28.9% contamination rate in SWEBench Pro undermines every model selection decision, procurement analysis, and agent infrastructure investment built on those numbers. The DataCurve finding forces teams using SWEBench Pro results to justify tooling or vendor choices to revisit those decisions using adjusted scores that may shift competitive rankings. If MMLU and ARC-AGI carry similar flaws, the leaderboard-driven consensus about which models lead the field may be systematically wrong, with direct consequences for anyone allocating engineering resources or budget based on published rankings.

Summary

DataCurve's post-hoc audit of GPT-5.5's SWEBench Pro results reveals a systematic problem: 28.9% of the benchmark's test cases are broken or incorrect, and those flawed tests account for 68.5% of the model's recorded failures. Strip out the bad tests and GPT-5.5's corrected score jumps to 86.7%, a material gap above the published leaderboard number. The finding extends to previously documented reliability failures in MMLU and ARC-AGI, suggesting the contamination is infrastructure-wide rather than isolated to one benchmark. Essentially: (OpenAI, DataCurve) are at the center of a benchmark credibility crisis that implicates the entire AI eval ecosystem. - 28.9% of SWEBench Pro test cases flagged as broken or incorrect by DataCurve's audit - GPT-5.5's corrected score of 86.7% materially exceeds its published figure - Reliability failures span MMLU and ARC-AGI, not only SWEBench Pro Published AI rankings may reflect test infrastructure quality as much as actual model capability.

Potential risks and opportunities

Risks

  • SWEBench Pro maintainers face pressure to validate or reject DataCurve's 28.9% broken-test finding within 30-60 days, with the benchmark's credibility and adoption at stake if they stay silent
  • Enterprise AI buyers who selected GPT-5.5 or competing models for code-generation workloads based on SWEBench Pro rankings face internal challenges if corrected scores shift the competitive order materially
  • Investors and analysts using MMLU, ARC-AGI, and SWEBench Pro as capability proxies for model valuation may be working from corrupted baselines, with no corrected dataset currently available for MMLU or ARC-AGI

Opportunities

  • Benchmark auditing and eval integrity services (Scale AI, Confident AI, and emerging evals-focused startups) can position corrected-scoring offerings as a billable service for enterprise AI procurement teams
  • Models with strong performance on code tasks that have avoided SWEBench Pro top-line comparisons gain a credible opening to re-enter sales cycles using adjusted benchmark narratives
  • Open-source benchmark maintainers (EleutherAI, HuggingFace) have a clear opening to launch audited, integrity-verified coding benchmarks as alternatives to contested leaderboards, accelerating adoption among practitioners who need trustworthy evals

What we don't know yet

  • Whether OpenAI or SWEBench Pro maintainers have acknowledged the flawed test cases and committed to a corrected benchmark release
  • DataCurve's methodology for classifying test cases as 'broken' has not been publicly peer-reviewed, leaving the 28.9% figure unverified by an independent party
  • The extent to which other top-ranked SWEBench Pro models (Claude, Gemini) would see comparably adjusted scores under the same audit framework