SWE-Perf Reference Patches Stable on 11 of 140 Tasks, Paper Finds
TL;DR
- SWE-Perf's official reference patches satisfied validity criteria on only 11 of 140 tasks when replayed across different Google Cloud machines.
- GSO and SWE-fficiency official rankings disagreed on 9 of 28 pairwise submission comparisons where the two leaderboards overlap.
- At least one public submission already matches or beats the reference patch on 85.3% of replay-valid GSO and SWE-fficiency tasks.
A new arXiv paper takes a close look at three benchmarks the field has been using to score how well AI coding agents optimize real code, and the results are unflattering enough that they should change how people read those leaderboards. The paper, posted to arXiv by Zhi Chen, Zhensu Sun, Yuling Shi, David Lo and Lingxiao Jiang, replays the official reference patches for GSO, SWE-Perf and SWE-fficiency on different Google Cloud machines and checks how many of those tasks produce a stable, valid result across runs.
The numbers are the story. SWE-Perf's own reference patches held up on only 11 of 140 tasks across machines. GSO managed 39 of 102, and SWE-fficiency held up on 411 of 498. The authors say SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes, so ordinary machine noise can flip the pass/fail answer. When the authors compare the two leaderboards where submissions overlap, the official rankings for GSO and SWE-fficiency disagree on 9 of 28 pairwise comparisons.
Two other findings are worth flagging. SWE-fficiency's leaderboard scoring rule assigns between 58.5% and 82.8% of the score weight to its worst ten tasks, so a handful of edge cases can dominate the ordering. At the same time, at least one public submission already matches or beats the reference patch on 85.3% of the replay-valid GSO and SWE-fficiency tasks, and 99.8% of submissions beat unoptimized base code. The headroom the benchmarks are measuring is smaller than the leaderboards make it look.
The honest caveat is that this is a single preprint replaying results on one cloud provider, and the benchmark authors have not responded on the record yet. What the paper doesn't give you is what a stability-corrected ranking of the top public agents would look like once the noisy reference patches are pruned. If the finding holds, the useful move for anyone reading these leaderboards is to treat them as a rough filter rather than a scoreboard, and to weight independent, machine-diverse replays more heavily than the posted numbers.
Originally reported by paper
Read the original article →Original headline: SWE-Perf Reference Patches Stable on Only 8% of Tasks Across Machines