TUA-Bench: top terminal agent solves just 65.8% of 120 tasks
TL;DR
- TUA-Bench evaluates terminal-use agents on 120 real-world tasks spread across five task families.
- The strongest frontier agent, Claude Code with Claude Opus 4.8 at max reasoning effort, scores 65.8%.
- Tasks cover document editing, email, live-web information seeking, and scientific and engineering workflows.
A new paper dropped that pushes back on a quiet assumption in the terminal-agent space, which is that we already know how good the best agents are. According to the paper introducing TUA-Bench, the strongest frontier agent tested, Claude Code running Claude Opus 4.8 at max reasoning effort, gets 65.8% on the benchmark's 120 real-world tasks. That leaves more than a third of the work unsolved by the best system the authors threw at it.
The angle that makes this interesting is what the 120 tasks actually look like. Prior terminal benchmarks have leaned heavily on coding and system administration, where progress has been fast and headline numbers have crept toward saturation. TUA-Bench instead spans five task families covering document editing, email management, live-web information seeking, and scientific and engineering workflows. In other words, the kind of mixed digital busywork a person actually does at a computer, just routed through a terminal so it can be scored deterministically with execution-based checks.
Why this matters if you are shipping agent products: the going narrative on model leaderboards is that the frontier is increasingly competent at end-to-end computer use. A 65.8% ceiling on routine, non-coding work is the more sobering counterpoint. If you are building on top of a terminal agent for a workflow that touches email and document handling, the headline benchmark score from a vendor is probably not the number that predicts how your users will feel about the product.
The honest caveat is that the abstract does not break out per-family scores, so we do not yet know whether the agent is steady on documents and weak on the web, or whether it fails roughly evenly across categories. The reporting also does not say how smaller or open-source agents fare on the same 120 tasks, which is the comparison that would tell us how much of the gap is a scale problem versus a training-data problem. Take the 65.8% as the headline, not as a per-task guarantee.
What is worth watching is whether TUA-Bench gets picked up by the labs building agent harnesses. A public yardstick for general computer use that is not just coding-shaped gives the frontier labs something concrete to compete on outside their usual benchmarks, and that is the kind of pressure that has historically moved real-world capability faster than another point on a coding leaderboard.
Originally reported by paper
Read the original article →Original headline: TUA-Bench: Best Terminal Agent Hits Only 65.8% on 120 General-Purpose Digital Tasks