arxiv.org web signal June 29th 2026

AI Agents Succeed at Neuroscience Stages, Fail Full Discovery Run

TL;DR

Coding agents solved individual stages of a fly optogenetics pipeline but could not complete the full end-to-end discovery run.
Agents struggle most when there is no predefined criterion to iterate on and must use their own scientific judgment, a key open challenge.
The study flags challenges absent from standard benchmarks: computational resource management and generalization to large held-out datasets.

The gap between what AI coding agents score on standard benchmarks and what they deliver inside a real scientific pipeline is well-suspected but rarely measured. New work on arXiv by Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, and Kristin Branson provides that measure, stress-testing general-purpose coding agents against a fly optogenetics neuroscience data-to-discovery pipeline using tasks substantially larger than existing benchmarks and datasets orders of magnitude bigger.

The result is more precise than a simple pass or fail. Agents can solve several individual pipeline stages, which the authors say suggests stage-level automation is tractable. The problem is composing those stage wins into a complete discovery run. That remains beyond agents' current abilities.

What explains the gap? The paper points to something structural. Agents struggle most when there is no predefined criterion to iterate on, forcing them to apply their own scientific judgment to assess their current solution. Mirroring how scientists actually work, agents sometimes attempt visual inspection of intermediate outputs for self-evaluation, but the paper reports they largely fail to interpret what they see or act on it appropriately. Beyond that failure mode, the authors identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections.

The honest caveat is that the abstract does not enumerate which specific stages agents solved or which models were evaluated, so it is hard to judge how quickly these gaps might close. What the paper does give you is a set of principles the authors distill for constructing scientific tasks and rigorous evaluation criteria for open-ended problems, a methodological contribution that may outlast any specific pass rate.

For research teams thinking about where to deploy coding agents now, the implication is reasonably clear: pipeline stages with explicit success criteria are candidates for automation today, while full end-to-end discovery requiring ongoing scientific judgment is not yet within reach.

Shared on Bluesky by 2 AI experts

Eugene Vinitsky @eugenevinitsky.bsky.social amplified

Kristin Branson @kristinmbranson.bsky.social

Agentic coding is genuinely useful now, and there are some impressive reports of AI agents doing science. But how well and how reliably can they handle tasks scientists actually want to hand off, ones that bottleneck pro…
View on Bluesky →
Kristin Branson @kristinmbranson.bsky.social: Agentic coding is genuinely useful now, and there are some impressive reports of AI agents doing science. But how well and how reliably can … →

Originally reported by arxiv.org

Read the original article →

Original headline: A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline