HealthAgentBench: Top Agent Clears Only 42% of 54 Clinical Tasks
TL;DR
- HealthAgentBench spans 54 agentic healthcare tasks across 7 categories, each replicating an end-to-end clinical workflow over raw healthcare data.
- The strongest and most cost-effective agent tested, Codex GPT-5.5, achieved only approximately 42% success — the highest score in the evaluation.
- Medical imaging is called out as especially challenging, with Claude Code models flagged as weak on visual analysis while Codex GPT-5.5 shows emerging capability.
Healthcare has always been the domain where the gap between 'the demo worked' and 'it's safe to deploy' is widest, and a new benchmark posted to arXiv puts a number on how wide that gap still is for agentic AI. HealthAgentBench is a suite of 54 tasks organised into 7 categories, and each task is designed to replicate an end-to-end clinical workflow, where an agent has to explore raw healthcare data, operate within a complex environment, and execute multi-step solutions rather than just answer a prompt.
The headline result: the strongest and most cost-effective agent tested, Codex GPT-5.5, cleared only approximately 42% of the tasks. Every other frontier system landed lower. The authors' own framing is that agentic systems still need meaningful progress before real clinical deployment, and it is hard to argue with that ceiling on tasks designed to look like the actual work.
What the paper singles out is which parts break. Medical imaging is called out as especially challenging, with Claude Code models flagged as weak on visual analysis while Codex GPT-5.5 shows emerging capability there. Tasks that require large search spaces with compositional reasoning stay difficult across the board. On the more encouraging side, frontier agents did show promise in automatically developing research modeling pipelines over EHR data, which sits closer to an analyst-assistant use case than to autonomous clinical decision-making.
The honest caveat is that a benchmark is a proxy, not a deployment test. HealthAgentBench doesn't tell you how these agents behave on real PHI, on the messiness of live EHR data, or under the audit and liability constraints a health system actually operates under. The reporting also doesn't break out which task categories drag the aggregate score down the hardest, which is the first thing a buyer would want to see before treating a vendor claim as credible.
For anyone budgeting clinical agent pilots, this is a useful reality check. The benchmark is public, model labs now have a concrete target to close, and health system buyers have something to point at when a vendor's demo feels a little too smooth.
Originally reported by paper
Read the original article →Original headline: HealthAgentBench: Best Frontier Agent Clears Only 42% of 54 Realistic Clinical Agentic Tasks