swe-rebench.com via Reddit

SWE-rebench puts GPT-5.5 first on clean coding leaderboard

openai anthropic cursor coding tools coding-benchmarks ai-agents model-evaluation

Key insights

  • GPT-5.5 leads SWE-rebench's verified track at 88.7%, while Claude Opus 4.7 tops the hardest SWE-bench Pro tier at 64.3%.
  • SWE-rebench uses private GitHub repos to prevent training data contamination, making scores more reliable than public benchmark results.
  • The leaderboard covers March-May 2026 cumulative results across hundreds of real-world software engineering tasks with verified ground truth.

Why this matters

Benchmark contamination has been the central methodological flaw in coding agent evaluation, where models trained on public repos inflate scores when tested on those same repos. SWE-rebench's private-repo methodology produces the first trustworthy cross-model comparison of GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6, giving engineering teams actionable signal for agent selection decisions. The divergence in results, with GPT-5.5 leading on breadth at 88.7% and Opus 4.7 winning on hardest tasks at 64.3%, indicates that model strengths now segment by task complexity, with direct implications for which agent to deploy depending on codebase difficulty profile.

Summary

SWE-rebench released its March-May 2026 coding agent leaderboard, testing GPT-5.5, Claude Opus 4.7, Cursor Composer 2.5, and Kimi K2.6 on private GitHub repos to close the contamination loophole that inflates scores on public benchmarks. GPT-5.5 leads the verified track at 88.7%. Opus 4.7 tops SWE-bench Pro, the hardest task tier, at 64.3%. That split separates broad task coverage from ceiling performance on complex engineering work. Essentially: (OpenAI, Anthropic) trade off differently against task difficulty. - GPT-5.5 at 88.7% on the verified track, best overall - Opus 4.7 at 64.3% on the hardest task tier - Hundreds of real engineering tasks with independently verified ground truth Developers choosing a coding agent now have the cleanest cross-model comparison available from a methodologically credible source.

Potential risks and opportunities

Risks

  • OpenAI and Anthropic face credibility risk if GPT-5.5's 88.7% verified-track lead is later traced to leaked or near-duplicate private repos in training data
  • Developers who adopt agents based on current SWE-rebench rankings could find performance profiles shift significantly at the next quarterly refresh before August 2026
  • Kimi (Moonshot AI) and Cursor risk losing enterprise sales cycles if their full SWE-rebench scores land materially below GPT-5.5 and Opus 4.7 when complete numbers are published

Opportunities

  • Anthropic can use Opus 4.7's SWE-bench Pro lead to target enterprise customers with complex, high-stakes codebases where hardest-task ceiling performance matters most
  • SWE-rebench could attract benchmark licensing revenue from AI labs and enterprises seeking private, contamination-resistant evaluation runs for internal model selection
  • Cursor and Moonshot AI can use any competitive showing to negotiate partner integrations with IDE vendors (JetBrains, VS Code) ahead of the next leaderboard cycle in Q3 2026

What we don't know yet

  • Cursor Composer 2.5 and Kimi K2.6 exact scores on the verified and SWE-bench Pro tracks not disclosed in available public reporting
  • Whether the private repo dataset has been independently audited for selection bias that could still favor certain models' training distributions
  • No breakdown of which engineering task categories (bug fixes, feature additions, refactoring) drive the GPT-5.5 vs. Opus 4.7 performance divergence