arxiv.org web signal

OLMo study: bigger models retain rare tasks via less interference

TL;DR

  • The paper argues power-law scaling already implies a larger model will learn parts of the data distribution a smaller model cannot, even with infinite training data.
  • Pretraining experiments on OLMo models from 4M to 4B parameters found only the larger models learned the infrequent and complex tasks.
  • The proposed mechanism is reduced gradient interference: weaker common-task updates leave rare-task features intact in larger models.

There is a question that has been hovering around scaling debates for a couple of years. When a bigger model can do something a smaller one cannot, is the smaller model missing the expressive capacity, or is something else going wrong? A new paper on arxiv argues that, for a lot of practical cases, the answer is closer to the second.

The authors propose that power-law scaling already implies a larger model will learn parts of the data distribution that a smaller model fails to learn, even with infinite training data. Their account is data-centric. Smaller models, they argue, allocate their neurons to high frequency or low complexity tasks, so rare and complex tasks end up underlearned even when a network of that size could in principle express them.

To check this, they pretrain OLMo models from 4M to 4B parameters on novel tasks of varying frequency and complexity, and report that only the larger OLMo models successfully learn the infrequent and complex ones. The proposed mechanism is what they call reduced interference: in a larger model, the gradient updates for common tasks become weak enough that they stop overwriting rare-task features as those features slowly accumulate. The larger models also embed more task features in their representations.

For anyone making data mixture or sizing calls, this reframes a familiar problem. If your model is missing a long tail of behaviour, the paper's claim is that piling on more rare-task data may not be sufficient. The bottleneck can be a capacity-and-interference one, which is a different fix.

The honest caveat is that the experiments stop at 4B parameters and lean on a synthetic setup alongside the OLMo runs, so how cleanly the interference story extends to frontier-scale models, or to fine-tuning rather than pretraining, is not something the paper resolves. What the reporting does not give you is a quantitative threshold for where the interference regime kicks in. But as a mechanistic argument for why the scaling curves look the way they do, it is the kind of result that ought to sharpen how teams think about both monolithic scaling and routed alternatives like mixture-of-experts.

Shared on Bluesky by 1 AI expert