arxiv.org web signal

Michaelov et al.: Closed LLMs Are Ill-Suited for Science

TL;DR

  • Closed model performance can vary by up to 60% across undocumented version updates, making past benchmark results irreproducible.
  • The paper defines open-weight models as providing complete weights, tokenizer, full architecture, and sufficient code to run independently.
  • Authors recommend researchers justify model selection and systematically document threats to inference when using LLMs in research.

The question lurking behind a lot of LLM-in-science papers finally has a direct treatment. A paper on arxiv by James A. Michaelov and colleagues argues that current closed language models are "generally ill-suited for scientific purposes," and lays out the structural reasons why, not as a complaint about access, but as a technical analysis of what threatens reliable inference.

Three specific problems get named. The versioning problem: closed models are updated without documentation, and the behavioral changes can be substantial. The paper cites a case where benchmark performance "varies by up to 60% between the March and June 2023 versions of GPT-3.5 and GPT-4," with older versions eventually disappearing entirely and making past results irreproducible. The credit assignment problem: when you test a closed model, you are not measuring the language model itself but a compound system involving hidden system prompts, guardrails, and filtering, meaning "the behavior of any given Closed Model cannot reliably be attributed to the language model it contains." Third, text output alone does not give researchers access to probability distributions or internal states that interpretability research and robust model comparison require.

The paper's recommended alternative is open-weight models, defined as providing complete weights, tokenizer, full architecture accounting, and sufficient code to run the model independently. Models like Pythia and OLMo, which also release training data, training code, and checkpoints, are held up as examples of the broader transparency that scientific research benefits from.

The authors do not argue closed models have no place in science. They flag appropriate uses including "one-off solution generation and existence proofs" and research studying the societal effects of deployed systems. But claims about a closed model's superiority on a benchmark require, according to the paper, "convincing evidence that threats to robustness were mitigated." The honest caveat is that "open models do not guarantee reliable inferences" either. Open access is a necessary condition for certain kinds of scrutiny, not a guarantee of rigorous results.

For researchers using LLMs as methodology today, the paper's practical recommendation is to identify potential threats to inference, document mitigation steps, and provide specific justifications for model selection. That is a disclosure norm the research community could adopt before journals decide to impose one.

Shared on Bluesky by 2 AI experts