OpenBioRQ: AI Agents Cite Wrong Papers 15.9% of the Time
TL;DR
- AI agents resolve over 99% of citation URLs correctly, but approximately 15.9% of those citations link to the wrong paper.
- OpenBioRQ benchmarks 12,553 unsolved biomedical questions across 12 domains as a faithfulness-and-abstention probe for agents.
- Frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) score in a 29-60% range on the hardest subset; open-weight models solve only about 17%.
The citation problem in AI agents turns out not to be hallucination in the usual sense. A new benchmark paper, OpenBioRQ, covers 12,553 unsolved biomedical research questions across 12 domains and finds that agents rarely fabricate citations: over 99% of cited URLs resolve correctly. The failure is subtler, with approximately 15.9% of those citations linking to papers that do not actually support the claim being made.
That distinction matters enormously for how you build and evaluate agents. If your benchmark only checks whether URLs resolve, you will score a system as nearly perfect on citation fidelity while missing a failure that affects roughly one in six citations in biomedical contexts. The benchmark deliberately uses open, unsolved questions as a faithfulness-and-abstention probe, because questions without known answers prevent models from simply reproducing expected sources.
The performance picture across current frontier systems is also sobering. Gemini-3-Pro, Opus-4.7, and GPT-5.5 achieved a wide 29-60% range on the hardest question subset, while open-weight models solved only about 17% of those questions. The paper also observes that on difficult questions, agents tend to stop using their retrieval tools entirely, a behavioral collapse that compounds the citation accuracy problem.
One methodological finding stands on its own: a frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82. That is a large jump, and it suggests that a significant portion of what looks like model disagreement in agent evaluation studies may actually be evaluation disagreement.
The honest caveat is that this is a single paper covering biomedical questions specifically, and the 15.9% rate may not transfer directly to other domains. What the paper does not address is whether wrong citations stem from a retrieval failure or a generation failure. For teams building or auditing AI research assistants, OpenBioRQ is described as non-saturating, meaning it should remain a useful benchmark as models continue to improve.
Originally reported by paper
Read the original article →Original headline: AI Agents Cite the Right Link but the Wrong Paper 15.9% of the Time