Diaz proposes trajectory-preference eval to unstick agent benchmarks
TL;DR
- Fernando Diaz argues success-only metrics tie agent comparisons on roughly 75% of instances, gutting statistical power.
- His preference-based trajectory evaluation compares progress and time-to-return profiles, cutting ties to roughly 35%.
- The paper suggests apparent benchmark saturation may reflect the evaluation measure, not exhausted data or problems.
Agent benchmarks have been throwing off a lot of ties lately, and a new arXiv preprint from Fernando Diaz argues the ties themselves are the story. In Offline Preference-Based Trajectory Evaluation, posted June 16, 2026, Diaz reports that standard success-based metrics tie comparisons on roughly 75% of instances across the agentic and interactive benchmarks he tested. When most head-to-head matchups collapse to a tie, the effective sample size drops and you lose the ability to tell two systems apart, even if one is genuinely better on the way to the answer.
Diaz's proposal is to stop scoring only the terminal outcome and to compare trajectories directly, using what he calls temporal preferences over progress and time-to-return profiles. In his experiments the tie rate falls to roughly 35%, which he frames as an improvement in discriminative power, ranking stability, and data efficiency. The framing that stuck with me is at the end of the abstract: the widely-noted "benchmark saturation" that everyone attributes to exhausted data or problems that are simply too easy, may also just be a story about the evaluation measure.
Why this matters if you are not building benchmarks yourself: the cost of a meaningful eval run scales with how many comparisons you need before the noise clears. If a trajectory-aware score can cut ties by more than half, teams tuning agentic systems on modest budgets get more signal per dollar, and buyers get a sharper way to challenge vendor claims of parity. Public leaderboards, if they adopt something like this, could suddenly rank models that today read as indistinguishable.
The honest caveats are the usual ones for a single-author preprint. The paper as retrieved does not name the specific benchmarks in the abstract, does not describe how the temporal preferences are elicited or labeled, and has not been independently reproduced. Take the 75% and 35% figures as reported by Diaz, not as settled numbers. What the reporting on the arXiv page does not give you is whether trajectory-preference rankings actually correlate with the downstream utility a practitioner cares about, or whether they can be gamed by agents that look busy without finishing the task.
The direction is still the part worth watching. If "we ran out of benchmark" turns out to often mean "we ran out of metric," the cheapest agent-evaluation upgrade of the next year may not be a new dataset at all.
Shared on Bluesky by 2 AI experts
-
Please try to look deeper than success rate for agent evaluation. Thank you. arxiv.org/abs/2606.17541
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: Offline Preference-Based Trajectory Evaluation