arxiv.org web signal

Study: No AI Model Spontaneously Proposes Null Hypotheses in Science

TL;DR

  • No model class tested — reasoning or non-reasoning — spontaneously proposed null hypotheses, a move scientists make more freely.
  • Non-reasoning LLMs converge into a 'hivemind' of similar ideas; LLMs struggle most in pluralistic fields like the social sciences.
  • The study drew 25,139 rating sets from 6,749 scientists across biology, medicine, chemistry, and the social sciences.

When AI boosters project that large language models will accelerate scientific discovery, they tend to skip an awkward question: can these systems do the part of science that involves being wrong on purpose? A new preprint on arxiv.org, by Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, and James A. Evans, mounts what it calls the largest scientist-in-the-loop evaluation to date and maps where that acceleration narrative breaks down.

The scale is unusual. The researchers invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that LLMs generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 rating sets covering novelty, empirical feasibility, probability of being true, and favorability of adoption. The result is a rare attempt to measure AI scientific creativity against genuine expert judgment rather than automated proxies.

The central finding is specific: no model class spontaneously proposes null hypotheses, "a move humans make more freely." Non-reasoning LLMs compound this by collapsing into "a narrow 'hivemind' of similar ideas." Reasoning models explore broader spaces but the null hypothesis gap persists across all model classes. LLMs also falter most in pluralistic fields like the social sciences "that demand context-aware interpretation," where senior social scientists proved the harshest critics.

The paper adds a warning about automated evaluation itself. LLM-as-a-judge, artificial metrics, and even state-of-the-art models "agree only weakly with expert judgment." A custom Qwen3-14B reward model trained specifically on the human ratings gathered in this study achieved meaningful improvement — suggesting that teams building AI research tools need expert-grounded calibration built in from the start.

The honest caveat is that the study rates AI-generated ideas, not AI-completed research, so whether these gaps translate into downstream failures in actual research pipelines is an open question the paper does not answer. It also covers four disciplines; whether physics, mathematics, or engineering show the same patterns is untested. But the null hypothesis finding is not an edge case. It is the move that lets science interrogate its own premises, and an AI that skips it systematically is a brainstorming partner with a specific and consequential blind spot.

Shared on Bluesky by 2 AI experts