marktechpost.com web signal

OpenAI's LifeSciBench Tests AI on 750 Life-Science Research Tasks

openai healthcare benchmarks healthcare life-science

TL;DR

  • GPT-Rosalind led five models with a 0.576 normalized score and 36.1% task pass rate on LifeSciBench.
  • Each of the 750 tasks carries roughly 25 grading criteria, totaling 19,020 expert-written rubric points.
  • GPT-Rosalind's pass rate dropped from 45.1% on text-only tasks to 28.1% when artifacts were included.

The most candid thing about OpenAI's LifeSciBench, as reported by MarkTechPost, is what it reveals about the ceiling: the best model tested, GPT-Rosalind, passed only 36.1% of tasks when graded against expert-written rubrics. That is the leading score, not a laggard's.

The benchmark covers seven biological domains and seven research workflows, with tasks authored by 173 Ph.D.-level scientists and validated by 453 reviewers, 97% of whom hold doctorates. What makes it structurally different from most AI evals is the grading approach: each of the 750 tasks carries roughly 25 individual criteria, producing 19,020 total grading points that reward specific facts, reasoning steps, and numeric answers rather than just a final answer. About 79% of tasks require multiple reasoning steps, averaging four steps each, and two scores are reported: a normalized rubric score that gives partial credit and a stricter task pass rate requiring a 70% threshold.

The five models evaluated in single-turn settings with unrestricted internet access clustered around normalized scores between 0.576 for GPT-Rosalind and 0.399 for Grok 4.3. GPT-5.5 scored 0.519, Gemini 3.1 Pro 0.515, and GPT-5.4 0.479. Translation and scientific communication tasks were relative strengths across models. Design and optimization workflows proved hardest. The most striking finding may be the artifact penalty: GPT-Rosalind's task pass rate dropped from 45.1% on text-only tasks to 28.1% when tasks included sequences, figures, tables, chemical structures, or PDFs.

The honest caveat is the one the numbers surface themselves: no single model passed 171 tasks (22.8% of the set), and 261 tasks showed best-model pass rates below 20%. What the reporting does not give you is whether the benchmark tasks and rubric criteria are openly published for independent replication, or how the authors controlled for contamination between training data and benchmark content. The benchmark was built with OpenAI involvement and GPT-Rosalind leads it, which is worth holding in mind when weighing the comparative scores.

For biopharma teams evaluating AI platforms and for smaller vendors targeting life sciences, LifeSciBench now offers something rare: a structured, expert-grounded rubric against which to measure progress. The 261 tasks where even the best model struggles represent the clearest public statement yet of where the actual research gap sits.