paper web signal

CLI Agents Surpass GUI Once Skill Coverage Gaps Are Filled

TL;DR

  • A matched 440-task benchmark finds the best GUI agent reaches 59.1% success versus 48.2% for CLI with original skills.
  • Augmenting CLI skill coverage raises its success rate from 48.2% to 69.3%, surpassing the best GUI agent entirely.
  • The benchmark covers 18 applications and 12 workflow categories using identical goals, states, and verifiers for both modalities.

For years, comparisons between GUI-driven and CLI-driven computer-use agents have been confounded by a basic design flaw: the tasks, starting states, and success verifiers were rarely matched, so performance differences could reflect the benchmark as much as the agents. A new paper on arXiv from Xiao Zhou, Siyue Zhang, and colleagues attacks that problem directly, constructing a benchmark of 440 desktop tasks across 18 applications and 12 workflow categories where both modalities face identical conditions.

The headline numbers are striking. The best GUI agent reached a 59.1% full pass rate, while the CLI agent using its original skill set landed at 48.2%, a gap of roughly eleven percentage points. But when the researchers augmented the CLI agent's skill coverage, its success rate climbed to 69.3%, clearing the GUI ceiling entirely. That result reframes the debate considerably. The CLI deficit was not primarily a question of which modality is more capable; it was a question of whether the CLI had the right tools available.

The paper draws a useful distinction between the two kinds of ceilings involved. GUI agents hit their limits around reliable grounded interaction over long-horizon workflows, the challenge of tracking state across many steps through a visual interface. CLI agents, by contrast, are bottlenecked by the coverage and scalability of their skill interfaces. These are different engineering problems that call for different investments.

The honest caveat is that a controlled benchmark of 440 tasks, however carefully matched, is not a deployment. Real environments are messier, the application surface is larger, and skill augmentation at scale carries its own maintenance overhead. What the study does not give you is a clear picture of which specific task types or applications showed the largest skill gaps, or how either modality holds up on the genuinely long-horizon workflows where GUI agents reportedly struggle most.

For teams building or evaluating computer-use agents, the actionable read is that skill inventory is a first-order problem worth auditing before investing in deeper model changes. For benchmark designers, the controlled matched framework is a template worth borrowing.

Shared on Bluesky by 1 AI expert