arxiv.org web signal

AGORA: top model hits 59.4% on workplace document reasoning

TL;DR

  • AGORA pairs 362 questions with eight domain-specific document collections totaling 9,664 files and 372 million tokens.
  • The strongest of eight evaluated models reached only 59.4% accuracy, with significant variation across domains.
  • The corpus exceeds any single model's context window, forcing deliberate exploration over exhaustive scanning.

A new benchmark out of an arXiv preprint takes aim at something most long-context evaluations sidestep: what happens when an AI agent has to find an answer inside a real workplace archive, not a tidy single document. The paper, AGORA on arXiv, pairs 362 questions with eight domain-specific document collections, 9,664 files in total, adding up to 372 million tokens. That last number matters because it sits well past any single model's context window, so the agent has to actually explore rather than stuff everything in and hope.

The headline result is the kind of honest number that benchmarks rarely produce on their first run. Across eight evaluated models, the strongest reached just 59.4% accuracy, with significant variation across the different domains. In other words, the best system out there gets a little better than half of these questions right when forced to reason across messy collections with inconsistent terminology, units, and time formats. The authors frame this as substantial room for improvement, which reads as understatement.

Why this matters if you are not building benchmarks yourself: this is the shape of the work agents actually get asked to do inside companies. Find the right contract clause across thousands of files. Reconcile a number reported three different ways. Trace a decision through email threads and PDFs that nobody bothered to normalize. Most existing long-context tests reward models that can hold a haystack in their head; AGORA is built so that approach does not work, because the haystack will not fit.

The honest caveats. The reporting here is the paper itself, single-sourced, and the abstract does not name which eight models were evaluated, which domain collections were hardest, or what the obfuscation step actually looks like in practice. The leakage-preventing obfuscation, cross-document task synthesis, and difficulty filtering are described as pipeline components, not audited procedures, so take the 59.4% as a point estimate from one team's setup rather than a settled industry number.

The direction worth watching is whether agentic frameworks built for exactly this shape of problem, retrieval plus planned exploration plus reconciliation, can close the gap on a benchmark that was deliberately designed to punish brute force. If they can, the case for deploying document-reasoning agents on real archives gets a lot more credible. If they cannot, the gap between demo and deployment stays wide.

Shared on Bluesky by 2 AI experts