huggingface.co web signal

AGORA Benchmark Sets 59.4% Ceiling for AI Document-Reasoning Agents

agents ai-agents benchmarks document-understanding

TL;DR

  • AGORA pairs 362 questions with 9,664 authentic workplace documents spanning 372 million tokens across 8 domains.
  • The best of 8 evaluated models, Gemini-3.1-Pro, answers only 59.4% of queries correctly, leaving the task far from solved.
  • Smaller models effectively fail: Qwen3.5-9B scores at or below 3% in five of eight domains.

Most benchmarks for language model agents hand them a handful of documents and ask a question. AGORA, introduced by researchers from Fudan University, Zhejiang University, and Shanghai Qiji Zhifeng, does something deliberately harder: it pairs 362 questions with eight collections of 9,664 authentic workplace documents totaling 372 million tokens — a corpus far too large for any model to read at once. The design is intentional. By exceeding every model's context window, AGORA forces planned, deliberate exploration rather than brute-force scanning, which is exactly the condition faced by financial analysts, legal researchers, and policy teams working over real internal archives.

The results are a clear-eyed reality check on where enterprise document agents actually stand. Eight models were evaluated inside a minimal harness that exposes only a bash tool, keeping the comparison focused on model reasoning rather than scaffolding engineering. The strongest model, Gemini-3.1-Pro, answered only 59.4% of queries correctly. The eight models split into two sharply separated tiers: a frontier group clustered in the 40-60% band, and a lower group that falls well below — the gap between tiers reaches 28.73 points, exceeding any gap within either group. Smaller models fare worse still: Qwen3.5-9B scores at or below 3% in five of eight domains, and Gemini-3.1-Flash-Lite at or below 7% in six, which the paper describes as effectively non-functional on AGORA.

The failure mode breakdown is where things get actionable. Most errors across frontier models trace to three evidence-grounded categories: the agent skips a required document, extracts the wrong value from the right file, or ignores a stated requirement of the query. Hallucination stays below 12% for frontier models but climbs to around 40% for smaller ones, suggesting the tier gap reflects evidence discipline more than reasoning depth. Resource exhaustion — running out of turns, time, or context budget — is the wildcard: it accounts for GPT-5.5's most common failure mode at 24.59% of wrong traces, is near-zero for the DeepSeek-V4 family (at or below 1.10%), and becomes catastrophic for Gemini-3.1-Flash-Lite at 69.61%.

The honest caveats: all models ran through a single minimal harness, and the paper notes absolute accuracy may shift under heavier frameworks. The difficulty-filtering panel also shared three models with the evaluation set, which may have slightly skewed benchmark calibration against them. What the results cannot tell you is how much improvement richer scaffolding or specialized retrieval strategies might buy — that comparison is left to future work.

For teams building or buying enterprise document agents, the 40%-plus failure rate from the best available model is a concrete constraint on what can be deployed autonomously today. The construction pipeline is designed to be reusable, so the benchmark can be refreshed as models improve — meaning AGORA is likely to stay relevant for however long it takes the frontier to actually crack deliberate, archive-grounded reasoning.