arxiv.org web signal

ATLAS automates hypothesis generation and experiment design for mechanistic science

TL;DR

  • ATLAS combines sparse neural networks with active learning to generate and test mechanistic hypotheses automatically.
  • The system achieved 5-10x better sample efficiency than random experimentation across all evaluation metrics.
  • Validation compared ATLAS-designed experiments against expert-designed ones from published cognitive science literature.

Most machine learning applied to science is predictive: given data, forecast an outcome. ATLAS, a framework from Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, and Kevin J. Miller, aims at something harder and arguably more useful: mechanistic modeling, the task of recovering the actual process that generated observed behavior, not just predicting what comes next.

The system works by alternating between two steps. First, an ensemble of sparse neural networks called Disentangled RNNs generates a set of candidate mechanistic hypotheses about an agent's behavior. Second, an active learning module designs the next experiment specifically to discriminate between those competing hypotheses rather than picking experiments at random. The authors tested this loop on the problem of recovering reinforcement learning agents from their behavior in bandit tasks, a standard cognitive science setting.

The headline result is a 5-10x improvement in sample efficiency across all evaluation metrics compared to random experimentation. The system also matched the quality of expert-designed experiments drawn from published cognitive science literature, and generated novel experiments with temporal structure tailored to the agents under study.

The honest caveat is that bandit tasks are a relatively constrained domain, and it remains to be seen whether the sparse neural network hypothesis space is expressive enough to handle the more open-ended mechanistic questions that arise in neuroscience, pharmacology, or other fields where the true generative process is not a known RL algorithm. What the reporting does not give you is a clear picture of the computational cost in practice or how the system behaves when the correct model falls entirely outside its hypothesis space.

For cognitive scientists and behavioral researchers, though, the direction is worth watching. The core bottleneck in mechanistic science is often not a shortage of data but a shortage of good experiments. A system that designs discriminative experiments automatically, and does so with materially better sample efficiency than random selection, changes what is feasible for smaller labs that currently depend on senior scientist intuition to plan their study designs.

Shared on Bluesky by 3 AI experts