Paper Finds Bigger AI Agents Can Be Worse at Abstaining
TL;DR
- A new paper evaluates 13 LLM-as-agent systems and 2 agent scaffolds on more than 28,000 tasks across web shopping, terminal, and QA environments.
- The authors find that larger or more capable models sometimes perform worse at timely abstention, not better.
- Their CONVOLVE context engineering method raised Llama-3.3-70B's timely recall on WebShop from 26.7 to 57.4 without updating model parameters.
A new paper from Han Luo, Bingbing Wen and Lucy Lu Wang, posted on arXiv as "Agentic Abstention: Do Agents Know When to Stop Instead of Act?", takes a careful look at a failure mode that does not show up on the usual leaderboards: an LLM agent that keeps calling tools on a task that was never going to succeed. The authors define agentic abstention as a sequential decision problem where, at every turn, the agent can answer, abstain, or gather more information, and argue that the interesting question is not only whether agents can abstain but when they do it.
The scope of the study is what makes it worth paying attention to. They evaluate 13 LLM-as-agent systems and 2 agent scaffolds on more than 28,000 tasks spanning web shopping, terminal environments and question answering. Their headline finding is that some agents "never abstain when they should, while others do so only after many unnecessary interactions," and that the gap is especially large on tasks where the instruction "appears feasible until the environment reveals otherwise (e.g., no valid result matches the instruction)." The bit that cuts against the usual scale-is-all-you-need framing is their report that "larger or more capable models sometimes perform worse at timely abstention." Note the "sometimes": the paper does not claim a clean monotonic relationship.
They also introduce a method called CONVOLVE, described as a context engineering approach that distills full interaction trajectories into reusable stopping rules. On WebShop, the authors report that CONVOLVE raised Llama-3.3-70B's timely recall rate from 26.7 to 57.4 without updating model parameters. That is the kind of number that, if it holds up, points at a fairly cheap intervention for teams that cannot fine-tune frontier models.
The honest caveat is that this is one paper with one set of benchmarks, and the abstract does not break out which specific frontier models got worse with scale, by how much, or whether the CONVOLVE gains carry over from WebShop to the terminal and QA settings. Take the specifics as reported, not settled.
What I would actually watch from here is whether agent platform teams start scoring timely abstention as a first-class metric alongside task success. If they do, the practitioners who benefit first are the ones running open-weight agents on real workflows, where a context-engineering trick that doubles timely recall is worth a lot more than another point of leaderboard accuracy.
Originally reported by paper
Read the original article →Original headline: Bigger AI Agents Are Worse at Knowing When to Stop