paper web signal

Paper warns SDPO can collapse in continual post-training

TL;DR

  • The paper reports SDPO can accelerate in-domain specialization when teacher signals are stable, but struggles to generalize out of distribution.
  • In continual post-training, SDPO exhibits stronger forgetting than GRPO and can even collapse, according to the authors' experiments.
  • Denser self-distillation induces larger drift in parameter and response space and can amplify formatting artifacts via a teacher-student loop.

A quiet result on arXiv this week is worth flagging for anyone doing continual post-training. Self-distillation policy optimization, or SDPO, has been picking up steam as an attractive way to keep on-policy data flowing while a model learns new skills without forgetting old ones. A new paper from Meng Wang and collaborators argues the story is more complicated than that.

The authors report that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, which is the part everyone likes about it. But once you push it into continual post-training, where the model has to keep acquiring new knowledge while preserving what it already knew, they find SDPO exhibits stronger forgetting and can even collapse. GRPO, the on-policy reinforcement learning method it is often benchmarked against, adapts more conservatively and better preserves prior capabilities, according to their experiments.

The mechanism they point at is worth pausing on. Denser self-distillation, they write, induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher-student loop. So the failure mode isn't just underperformance, it's a compounding one where the teacher gradually rewards its own tics until the response distribution narrows.

The honest caveat is that this is a single paper and the abstract is what has landed publicly so far. What the reporting doesn't give you yet is the specific model sizes, benchmarks, or the recommended mitigation beyond preferring GRPO-style updates. It also doesn't say whether the collapse pattern holds across different teacher-student model families or is tied to particular architectures.

The takeaway I'd carry forward: on-policy data alone is not the free lunch for continual learning that some recent write-ups have implied. If your pipeline treats dense self-distillation as a default stabilizer, this is a reason to at least benchmark against a more conservative on-policy RL baseline and log the drift, before the next checkpoint quietly forgets something you cared about.