Anthropic's VirBench exposes biology agents' retrieval problem
TL;DR
- On VirBench, a 120-query, 40-pathogen viral-sequence benchmark, tested AI agents posted mean accuracies from 16.9% to 91.3% without deterministic tooling.
- Adding gget virus, a deterministic retrieval layer built with NCBI researchers, lifted every tested model above 92% accuracy, peaking at 99.7%.
- Without the tool, Claude Sonnet 4 returned 106, 15, and 5 sequences across three identical runs of the same query, when 266 was expected.
A new piece of research from Anthropic is interesting less for what it claims about agents and more for where it locates the bottleneck. In a post on its research blog, the team behind a new benchmark called VirBench reports that current AI agents asked to pull viral sequence data from public databases land somewhere between unreliable and unusable, with mean accuracies ranging from 16.9% to 91.3% across the tested systems.
VirBench contains 120 realistic viral-sequence queries spanning 40 pathogens, each with a manually verified ground-truth count. The variability is the part that should make people pause: Claude Sonnet 4, given the same query three times in a row, returned 106, 15, and 5 sequences when the expected count was 266. In another analysis, that kind of drift produced phylogenetic trees placing an outbreak's origin anywhere from 1922 to April 2014, against an accurate date of January 2014.
The fix the authors propose is less glamorous than a smarter model. In collaboration with researchers at NCBI, they built a deterministic retrieval layer called gget virus that wraps a complex, browser-based workflow into a reproducible interface. Plugging that into the same agents pushed every tested model above 92% accuracy, peaking at 99.7%, and largely eliminated run-to-run variability. The framing the authors use is that "the bottleneck for biological agents is not only reasoning but the absence of widespread deterministic execution layers for querying biological data."
The case study sharpening that point is the ongoing outbreak of Ebola disease caused by Bundibugyo virus in the Democratic Republic of Congo. The post notes that on May 14, 2026, INRB Kinshasa analyzed 13 blood samples and confirmed Bundibugyo virus disease in eight the next day, and by May 29 the WHO had reported more than 1,000 confirmed and suspected cases including more than 200 deaths. In that setting, getting the right sequences out of public databases reliably is a workflow question with public-health weight.
The honest caveat is that this is one benchmark on one slice of biology, and the post itself anticipates a near future where models become "good enough to navigate messy portals" on their own, making the wrapper less necessary. What the reporting doesn't try to answer is how broadly the deterministic-layer pattern generalizes beyond viral sequences, or who bears the cost of maintaining one of these layers for every database. The near-term upside is concrete though: labs working with public sequence data get reproducible answers from off-the-shelf agents now, rather than waiting on the next frontier model to learn to click through a portal.
Shared on Bluesky by 2 AI experts
Originally reported by anthropic.com
Read the original article →Original headline: Paving the way for agents in biology