Plasticity loss in LLMs scales sublinearly, paper finds
TL;DR
- Authors tested GPT-style Transformer models from 5M to 314M non-embedding parameters and observed plasticity loss across every size on a held-out Vietnamese probing task.
- The paper reports the onset of plasticity loss grows sublinearly with model size, so larger models delay the problem but more parameters alone likely will not eliminate it.
- Plasticity loss also appears under stationary multilingual training, not only under abrupt continual-learning task changes, per the abstract.
A recurring assumption in the LLM era is that scale fixes most things, including the awkward question of what happens to a network's ability to keep learning after it has already learned a lot. A short paper up on arxiv by J. Fernando Hernandez-Garcia, Tomás Figliolia and Beren Millidge takes that assumption seriously and reports that scale does help, but mostly by delaying the problem rather than removing it.
The setup is small but pointed. The authors train GPT-style Transformer models ranging from 5M to 314M non-embedding parameters on a multilingual continual learning problem, then measure how well each model can still adapt to new data by tracking deterioration on a held-out Vietnamese probing task. They report plasticity loss across every size they tested, and they argue the onset of that loss follows a predictable scaling law that grows sublinearly with model size. In their words, larger models may delay the measurable effects of plasticity loss, but increasing parameter count alone is likely to be insufficient to completely prevent it.
The result that caught my eye is the second one. The authors say they also see plasticity loss under stationary multilingual training, not just under abrupt task changes. If that holds up, it pushes against the convenient view that plasticity loss is mostly a continual-learning artefact you can paper over by smoothing out the data distribution.
The honest caveat is the range studied. 5M to 314M non-embedding parameters is small by current frontier standards, and a sublinear scaling law extrapolated past the regime it was fit on is a hint about direction, not a number to bank on. The paper also does not tell you which mitigation techniques actually work at scale, which is the practitioner's real question, and the abstract is silent on how plasticity loss interacts with later stages like instruction tuning.
What the abstract does give you is a useful prior. If you are planning long pretraining runs followed by significant post-training distribution shifts, do not assume the next size bump alone will buy back the adaptability you lose along the way.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: Can Scale Save Us From Plasticity Loss in Large Language Models?