nature.com web signal

GUIDE-LLM Launches 14-Item Checklist for Behavioural AI Studies

TL;DR

  • GUIDE-LLM is a 14-item consensus checklist developed with over 80 experts to standardize LLM reporting in behavioural science.
  • Ambiguous model references, unreported prompts, and missing configuration details currently make LLM-based behavioural studies hard to replicate.
  • All 14 checklist items achieved strong consensus from over two-thirds of a global expert panel spanning psychology, economics, sociology, AI, and ethics.

The reproducibility problem in behavioural science is acquiring a new dimension from large language models. Research papers have referred to "ChatGPT" without specifying which version, left prompts unreported, and omitted configuration details that can change results meaningfully. A team led by Stefan Feuerriegel has published a consensus answer in Nature Human Behaviour: a 14-item checklist called GUIDE-LLM, grounded in the finding that "even small changes in prompts, settings, or model versions can lead to substantially different results."

The checklist was developed through a structured, multi-stage Delphi process involving over 80 experts from behavioral sciences, computer science, and ethics, spanning psychology, economics, and sociology. All 14 core items achieved strong consensus, with over two-thirds of the expert panel supporting inclusion. The items cover why and how LLMs were used, which model versions and configurations were chosen, how prompts were documented, how outputs were validated, and how code and workflows can be shared for replication.

The framing is deliberately modest. The authors state plainly that "GUIDE-LLM does not tell researchers how to use AI. Instead, it establishes a minimum standard for transparency." That matters because a checklist can be completed without meaningful disclosure, and the paper does not address whether journals will require compliance at submission. That is the lever most likely to determine whether the standard actually changes practice.

The checklist is freely available at llm-checklist.com in DOCX and LaTeX formats. Peer reviewers are the immediate beneficiaries: they currently have almost no basis for evaluating the methodological quality of LLM-based studies. Longer term, a minimum reporting floor matters most to anyone hoping to synthesise findings across a behavioural literature that is accumulating LLM-based evidence faster than it can establish shared norms.