paper web signal

DiscoBench: LLM Agents That Keep Searching Lose to Guessing

TL;DR

  • DiscoBench contains 211 samples and 463 ambiguity instances across 11 real-world domains, targeting how agents handle vague queries.
  • The paper finds that repeatedly searching instead of asking for clarification often performs worse than direct guessing.
  • The authors argue ambiguity detection and effective clarification are distinct capabilities current search agents largely lack.

A new arxiv paper argues that the direction most deep-search agents are being pushed — more retrieval, more tool calls, longer research runs — is nearly the opposite of what would actually help on the queries real users type. The paper introduces DiscoBench, a benchmark of 211 samples and 463 ambiguity instances spread across 11 real-world domains, and the headline result is that agents which respond to a vague question by running more searches often do worse than agents that simply guess.

The specific framing in the abstract is that "repeatedly searching instead of asking for clarification often performs worse than direct guessing." The authors treat ambiguity detection and effective clarification as two separate capabilities, and report that current LLM search agents are missing both. When a query is underspecified, ambiguity propagates along the multi-step reasoning chain, and the agent ends up in the wrong search trajectory even when its individual retrieval steps look fine in isolation.

Why this matters if you are shipping a deep-research product: the industry has spent the last stretch optimizing for longer runs, bigger contexts, more tools. This work suggests a whole class of failure is not fixable that way. If the query is vague and the agent will not stop to ask, more searching is not neutral, it can actively make the final answer worse. The evaluation is set up around four perspectives — task utility, ambiguity detection, interaction strategy, and cost efficiency — which is a useful signal that "answer quality" alone is no longer the right single scoreboard.

The honest caveat is that this is a single, recent benchmark from one team, with 211 items in 11 domains, and the abstract on arxiv does not tell you which specific commercial deep-search products the authors tested, how the user simulator behaves against real humans, or how big the gap is between the strongest and weakest models. Take the specifics as reported, not settled.

The forward-looking piece is what teams do with this. If knowing-when-to-ask is a distinct skill from retrieval, the interesting work over the next few quarters is training data, prompts, and evals built around that skill, and buyers get a concrete new thing to test vendors on: not just can your agent find the answer, but does it notice when it should not go looking yet.

Shared on Bluesky by 1 AI expert