paper web signal

DOPD Names On-Policy Distillation's 'Privilege Illusion'

TL;DR

  • DOPD identifies a failure mode called 'privilege illusion' that conflates a transferable capability gap with an information asymmetry gap that cannot be replicated.
  • The proposed fix routes token-level supervision between a privileged teacher and a privileged student based on their advantage gap, rather than treating tokens uniformly.
  • Reported experiments cover both LLM and VLM settings and claim gains over vanilla on-policy distillation plus improvements on stability, robustness, and out-of-distribution tasks.

A new preprint flags something small but important about the on-policy distillation recipe that most labs now lean on to shrink big teacher models into cheaper students. The authors call the failure mode the "privilege illusion", and their paper on arXiv argues it is what happens when you feed either the teacher or the student privileged information and then treat every token of the resulting supervision as equally trainable.

The claim is that this conflates two very different things: a transferable capability gap that students are meant to close, and an information asymmetry gap that can only be mimicked but never replicated. Because only a small subset of tokens carries what the authors call pivotal capability-bearing signals, routing all of that supervision back into the student produces what they describe as rapid entropy collapse, reduced exploration, and ultimately poor distillation effectiveness. The student memorises hints instead of learning to reason.

Their fix, DOPD, is described as an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between a privileged teacher and a privileged student policy based on their advantage gap. Different tokens get different strengths, objectives, and strategies, on the theory that only some tokens are worth close teacher imitation and the rest are better learned via the student's own signal.

The reported experiments span both large language model and vision-language model settings, with the authors claiming that DOPD consistently outperforms vanilla on-policy distillation and other counterparts, plus further improvements on stability, robustness, continual learning and out-of-distribution tasks.

The honest caveat is that this is a fresh arXiv writeup, not peer-reviewed, and the strongest baselines here are the authors' own choices. What the abstract does not give you is the compute overhead of running dual policies, how the method holds up against reinforcement-learning-from-teacher pipelines rather than pure distillation, or how it behaves on much longer reasoning traces than the covered benchmarks. But if the routing idea generalises, it lowers the cost of getting a small student close to a big teacher, which is exactly the seam every efficiency-conscious lab is working at right now.