OpenAI's GeneBench-Pro stumps top models, GPT-5.6 tops at 31.5%
TL;DR
- OpenAI released GeneBench-Pro on June 30, 2026, a 129-problem benchmark across genomics, quantitative biology, and translational biomedicine.
- GPT-5.6 Sol Pro hit 31.5% at maximum reasoning, GPT-5.6 Sol scored 28.7%, Claude Opus 4.8 reached 16.0%, and Gemini 3.5 Flash managed 8.1%.
- Reviewers estimated a typical problem would take a human expert 20 to 40 hours; 10 questions go public on Hugging Face and 50 to Artificial Analysis.
A new benchmark from a model lab that puts that same lab's model on top is usually worth a raised eyebrow, but the numbers in OpenAI's GeneBench-Pro announcement are interesting less for who won and more for how far below the ceiling everyone still is. GPT-5.6 Sol Pro tops the board at 31.5% with maximum reasoning and Pro mode on, GPT-5.6 Sol without Pro reaches 28.7%, Claude Opus 4.8 lands at 16.0%, and Gemini 3.5 Flash trails at 8.1%. On a benchmark of 129 problems that reviewers estimated would take a human expert somewhere between 20 and 40 hours each, 31.5% is a long way from anything a working scientist would ship.
The design is the part I would actually pay attention to. GeneBench-Pro pairs each of its 129 problems with a realistic, deliberately noisy dataset and a target answer tied to a downstream decision in genomics, quantitative biology, or translational biomedicine. Correctness is graded deterministically, which sidesteps the rubric drift that has weakened other long-horizon science benchmarks. OpenAI sent 82 of the 129 questions to external domain experts, described as graduate students, postdoctoral researchers, industry scientists, and professors, to check whether each problem was realistic and whether the target answer was actually identifiable from the data.
What OpenAI is really trying to measure is what it calls "research taste," the chain of judgment calls that shape an analysis: which questions the data can support, when early warning signs should change your model, and when the initial plan should be thrown out. That framing is a useful one for anyone shipping AI agents into biotech and pharma workflows, because it names the thing these agents keep getting wrong.
The honest caveat is that OpenAI is grading its own homework here, and the strongest non-OpenAI number, 16.0% for Claude Opus 4.8, is reported by OpenAI too, not by Anthropic. What the announcement does not give you is a per-problem compute cost, a breakdown of how the 82 externally reviewed items were distributed across subdomains, or clarity on whether every competitor was tested at its own maximum reasoning setting. Those gaps matter before anyone treats 31.5% as the state of the art.
The forward-looking part is the release plan. Ten representative questions are being open-sourced on Hugging Face and a 50-question subset is going to Artificial Analysis for independent benchmarking, which means the next few weeks should give the field a neutral read on whether the leaderboard holds up outside OpenAI's own harness. That is the number I would wait for.
Shared on Bluesky by 1 AI expert
Originally reported by openai.com
Read the original article →Original headline: OpenAI Introduces GeneBench-Pro — 129-Problem Computational Biology Benchmark Where GPT-5.6 Sol Pro Hits Just 31.5% and Claude Opus 4.8 Reaches 16%