reddit.com via Reddit May 27th 2026

DataCurve Corrects GPT-5.5 SWEBench Score to 86.7%

openai research benchmarks coding-ai model-evaluation

Key insights

DataCurve found 28.9% of SWEBench Pro test cases are broken or incorrect, skewing results across evaluated models.
GPT-5.5's corrected SWEBench Pro score rises to 86.7% after removing failures attributable to flawed benchmark tests.
The reliability problem extends to MMLU and ARC-AGI, suggesting systemic benchmark infrastructure issues beyond SWEBench.

Why this matters

Benchmark scores are the primary signal AI practitioners and investors use to compare model capabilities, so a 28.9% contamination rate in SWEBench Pro undermines every model selection decision, procurement analysis, and agent infrastructure investment built on those numbers. The DataCurve finding forces teams using SWEBench Pro results to justify tooling or vendor choices to revisit those decisions using adjusted scores that may shift competitive rankings. If MMLU and ARC-AGI carry similar flaws, the leaderboard-driven consensus about which models lead the field may be systematically wrong, with direct consequences for anyone allocating engineering resources or budget based on published rankings.

Summary

DataCurve's post-hoc audit of GPT-5.5's SWEBench Pro results reveals a systematic problem: 28.9% of the benchmark's test cases are broken or incorrect, and those flawed tests account for 68.5% of the model's recorded failures. Strip out the bad tests and GPT-5.5's corrected score jumps to 86.7%, a material gap above the published leaderboard number. The finding extends to previously documented reliability failures in MMLU and ARC-AGI, suggesting the contamination is infrastructure-wide rather than isolated to one benchmark. Essentially: (OpenAI, DataCurve) are at the center of a benchmark credibility crisis that implicates the entire AI eval ecosystem. - 28.9% of SWEBench Pro test cases flagged as broken or incorrect by DataCurve's audit - GPT-5.5's corrected score of 86.7% materially exceeds its published figure - Reliability failures span MMLU and ARC-AGI, not only SWEBench Pro Published AI rankings may reflect test infrastructure quality as much as actual model capability.

Potential risks and opportunities

Risks

SWEBench Pro maintainers face pressure to validate or reject DataCurve's 28.9% broken-test finding within 30-60 days, with the benchmark's credibility and adoption at stake if they stay silent
Enterprise AI buyers who selected GPT-5.5 or competing models for code-generation workloads based on SWEBench Pro rankings face internal challenges if corrected scores shift the competitive order materially
Investors and analysts using MMLU, ARC-AGI, and SWEBench Pro as capability proxies for model valuation may be working from corrupted baselines, with no corrected dataset currently available for MMLU or ARC-AGI

Opportunities

Benchmark auditing and eval integrity services (Scale AI, Confident AI, and emerging evals-focused startups) can position corrected-scoring offerings as a billable service for enterprise AI procurement teams
Models with strong performance on code tasks that have avoided SWEBench Pro top-line comparisons gain a credible opening to re-enter sales cycles using adjusted benchmark narratives
Open-source benchmark maintainers (EleutherAI, HuggingFace) have a clear opening to launch audited, integrity-verified coding benchmarks as alternatives to contested leaderboards, accelerating adoption among practitioners who need trustworthy evals

What we don't know yet

Whether OpenAI or SWEBench Pro maintainers have acknowledged the flawed test cases and committed to a corrected benchmark release
DataCurve's methodology for classifying test cases as 'broken' has not been publicly peer-reviewed, leaving the 28.9% figure unverified by an independent party
The extent to which other top-ranked SWEBench Pro models (Claude, Gemini) would see comparably adjusted scores under the same audit framework

Originally reported by reddit.com

Read the original article →

Original headline: r/ArtificialInteligence: DataCurve Analysis Finds 68.5% of GPT-5.5's SWEBench Pro Failures Were Caused by Broken Test Cases — Corrected Score Should Be 86.7%