OSWorld 2.0: Top Agents Solve Just 20.6% of Real Tasks
TL;DR
- OSWorld 2.0 is a 108-task benchmark of long-horizon computer-use workflows across everyday and professional tasks.
- The best frontier agent tested completed only 20.6% of tasks at 500 steps, with a 54.8% partial-credit score.
- A single task averages 318 tool calls with Claude Opus 4.7 at maximum thinking, versus about 30 in OSWorld 1.0.
A computer-use benchmark just landed that makes the past two years of agent demos look very different in retrospect, and the headline number is the one to start with. On OSWorld 2.0, a 108-task set of long-horizon workflows posted to arXiv on June 28, the best frontier agent tested completes only 20.6% of tasks at a 500-step budget. The partial-credit score on the same evaluation is 54.8%, which is the more honest read on how close current systems are getting.
Two specifics make the gap concrete. Each task takes human users a median of about 1.6 hours to finish, end to end. And a single task takes an average of 318 tool calls when run by Claude Opus 4.7 with maximum thinking, compared with about 30 in OSWorld 1.0. That is the dimension on which current agents come apart, not button-pressing, but holding a many-step real workflow together.
The paper's diagnosis of why is worth quoting. Agents 'lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification', and they struggle most 'when a task hinges on hidden state they must recover'. That points at memory and grounding over long horizons, not pixel-level UI control, as the binding constraint.
The honest caveat is that the retrieved abstract does not break down which task categories the 20.6% number comes from, does not tell you whether 318 tool calls reflect agent inefficiency or irreducible task structure, and does not define what counts inside the 54.8% partial score. Those distinctions matter if you are reading this as a procurement signal rather than a research one.
The forward-looking part is the upside for everyone who is not a top-three lab. A 20.6% ceiling is a lot of headroom, and the failure modes the OSWorld team names, constraint tracking, asking instead of guessing, mid-task information capture, and verification, are concrete enough to build evals and product features around. Expect the next round of frontier releases to be judged on whether they move the 20.6% number, or only the partial one.
Originally reported by paper
Read the original article →Original headline: OSWorld 2.0: Best AI Agent Completes Only 20.6% of Long-Horizon Real Tasks