huggingface.co web signal

GauntletBench: Frontier Agents Score 19.1% on Vision Tasks

TL;DR

  • State-of-the-art AI agents achieved only 19.1% task success on GauntletBench versus over 80% for non-expert humans.
  • The benchmark covers 100 vision-intensive tasks across five professional applications with 20 tasks each.
  • Temporal perception, graphical understanding, and 3D reasoning are identified as the largest capability gaps.

Benchmarks built on familiar software give agents a structural advantage, and the resulting scores can look impressive without revealing much about readiness for real professional work. GauntletBench, introduced by researchers at the University of Oxford and collaborators, is designed to test the other side of that equation. The paper is available on Hugging Face as part of a broader push to evaluate agents in genuinely unfamiliar territory.

The benchmark covers 100 vision-intensive tasks across five professional web-based applications: a video editor, a workflow builder, a 3D modeler, a flight analyzer, and a circuit designer, with 20 tasks each. These applications were chosen because they require capabilities the researchers identify as underexplored: temporal perception, graphical understanding, and 3D reasoning. State-of-the-art agentic systems reached a task success rate of 19.1%. Non-expert human annotators completed more than 80% of the same tasks. That gap is not modest.

The benchmark is built as a modular pipeline compatible with both open- and closed-source agent frameworks, with an automated evaluation engine using diverse metrics. The evaluation harness is published on GitHub, so other teams can run their own systems through the same controlled environment and track progress over time.

The honest caveat is that the paper does not report which specific frontier systems were tested, making it hard to know how much variation exists between agents or which application category proved hardest. Five web-based professional applications is also a narrow slice, and these controlled versions may not fully represent how specialists use the real tools in practice.

For teams evaluating agents in visual or domain-specific workflows, the implication is that current systems may perform far below their headline numbers in unfamiliar settings. The three identified gaps give researchers concrete targets, and the open evaluation harness gives them somewhere to measure progress.