Meta's SWE-Together ranks Opus 4.8 first across 109 sessions
TL;DR
- Meta researchers curated 109 reproducible coding tasks from 11,260 real user-agent sessions, a 0.97% conversion rate, sourced from DataClaw, Pi-staging, Hyperswitch and SWE-chat.
- Claude Opus 4.8 led seven frontier models with 63% pass@1, 59% stable solve rate, 0.801 mean judge score and the lowest mean User Correction at 1.38 per trial.
- User Correction is inversely correlated with pass@1 at Pearson -0.92, supporting the claim that stronger coding agents need less corrective steering to reach the same outcome.
The most interesting line in this Meta paper is not the leaderboard. It is a correlation. Across seven frontier models, the count of corrective messages a simulated user has to send before the coding agent finishes correlates with pass@1 at Pearson -0.92. Stronger agent, less pushback. The team behind SWE-Together frames that as a hypothesis their data supports, but the implication is the part worth sitting with: the headline benchmark for IDE-style coding assistants might end up being how little babysitting they need, not just how often they ship a passing patch.
The setup. Meta researchers reconstructed 109 reproducible repository-level tasks out of 11,260 real user-agent coding sessions drawn from DataClaw, Pi-staging, Hyperswitch and SWE-chat. A conversion rate of 0.97%, because most raw sessions don't pin to a recoverable commit or expose a verifiable outcome. Each surviving task is paired with an LLM user simulator anchored to the original user's intent and intervention order; it only speaks when the live trajectory warrants feedback, otherwise it stays silent.
The numbers, on seven frontier models using a common opencode harness with k=2 replicates per task. Claude Opus 4.8 leads on pass@1 (63%), stable solve rate (59%), pass² (52%) and mean judge score (0.801), and elicits the fewest corrections at 1.38 per trial. GPT-5.5 is second on mean judge (0.763) at 58% pass@1, and is the most efficient cohort at 29.9k output tokens and 10.7 minutes per task, while Opus 4.8 sits at 74.0k tokens and 23.3 minutes. Claude Opus 4.6 follows in third. GLM-5.2 and GLM-5.1 form the next tier. MiniMax-2.7 ranks last on every correctness metric and needs the most corrections, 2.17 per trial.
The honest caveat is that the reference patch, the recorded human solution, only scores about 78% pass rate against the same rubric, with roughly 35% of unsatisfied goals being process requirements like diagnosing the root cause before editing that a final code diff cannot express. So Opus 4.8's roughly 15-percentage-point gap to the reference is not pure unresolvability; some of it is the rubric penalising everyone equally. What the reporting doesn't give you is whether SWE-Together and the simulator prompts will be released, how the -0.92 correlation behaves at k greater than 2, or the per-model harness configuration inside opencode beyond 'common'.
If the result generalises with more replicates and a wider model panel, the practical move for vendors and procurement teams is a metric that puts user-effort cost on equal footing with final pass rate. That is closer to what an engineer actually feels in an IDE session than any pass@1 leaderboard has been.
Originally reported by huggingface.co
Read the original article →Original headline: Meta Open-Sources SWE-Together: 109-Task Multi-Turn Coding Agent Benchmark Replays Real User Sessions, Finds Claude Opus 4.8 Needs Least Corrective Steering at 63% pass@1