huggingface.co web signal

Claude Opus 4.7 Surpasses Nature-Family SOTA on Only 17.8% of Tasks

TL;DR

  • Claude Opus 4.7, the strongest of 10 tested agents, surpassed published Nature-family SOTA on just 17.8% of 90 scientific tasks.
  • Wrong method selection caused 45.1% of failures; agents defaulted to supervised ML pipelines in 41.4% of all runs.
  • When an agent's chosen method matched the source paper's broad method family, the match-SOTA rate rose from 29.6% to 37.7%.

A new benchmark from researchers at Frontis.AI and Tsinghua University puts a number on something the AI-for-science field has been dancing around: not whether agents can reproduce published results, but whether they can actually match or beat them. NatureBench tested ten frontier coding agents against 90 task packages drawn from peer-reviewed Nature-family journals published between 2022 and 2025, spanning six scientific domains including cellular omics, protein biology, biomedical modeling, physical modeling, molecular design, and relational reasoning. The benchmark uses each paper's own reported state-of-the-art score as a normalized target, so there is no room to claim success on a softer proxy.

The headline result is that Claude Opus 4.7, the best-performing agent, surpassed the published SOTA on only 17.8% of tasks and matched it on 47.8%. GPT-5.5 came in at 14.4% surpass and Gemini 3.5 Flash at 15.6%. When agents do succeed, the mechanism tells most of the story: supervised proxy prediction accounted for 45.5% of successful runs, meaning agents predominantly solved problems by converting them into standard machine learning pipelines rather than by reasoning about the underlying science. The authors call this pattern "methodological translation" rather than scientific invention, and agents concentrated 41.4% of all their runs in supervised predictive modeling regardless of what the source paper had used.

The failure analysis is the more instructive half of the paper. Wrong method selection caused 45.1% of failures across 900 task-agent runs, while insufficient compute budget or wall-clock time caused another 24.4%. Task misunderstanding was comparatively rare. Critically, when an agent's chosen method fell into the same broad family as the source paper's approach, match-SOTA rates rose from 29.6% to 37.7% -- a clean signal that method selection is the primary lever.

The honest caveat is that the benchmark is constrained by design. Tasks were filtered to problems that are ML-formulated, automatically evaluable, and based on publicly available data. That excludes large categories of scientific work requiring experimental design or hypothesis generation. The hardest domains in the evaluation -- biomedical modeling at 17.9% match rate and molecular design at 18.2% -- are already near the floor of current agent capability within this constrained scope, so performance on harder, less structured scientific tasks is likely worse still. Web search was also disabled for all agents and each task ran under a four-hour wall-clock budget, so it is unclear how either constraint shapes the results.

For teams building or evaluating scientific AI, the takeaway is structural rather than a leaderboard ranking: method selection, not task comprehension, is what separates success from failure. The researchers are releasing NatureBench, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction, with a stated long-term aim of converting the same substrate into training data for future scientific-discovery agents.