paper web signal

Control Token Spikes, Not Skill Loss, Drive Agentic RL Collapse

TL;DR

  • RL failure in multi-step tool-use tasks stems from control token probability spikes corrupting format, not from loss of underlying tool-use capability.
  • Process Reflection Supervision raised average scores from 3.50 to 25.75 on BFCL-V3, outperforming vanilla RL and SFT-interleaved approaches on Qwen2.5-1.5B-Instruct.
  • Interleaving SFT with RL stabilizes training but causes sharp performance degradation on out-of-distribution format and content scenarios.

Anyone who has watched an agentic RL training run suddenly produce malformed tool calls has probably assumed the model lost its tool-use ability. A paper from the Institute of Automation, Chinese Academy of Sciences says that diagnosis is typically wrong, and that the confusion has real consequences for how teams choose to fix it.

The paper, "Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It," isolates the failure mechanism: RL training causes "unexpected probability spikes in specific control tokens" like `<tool_call>` and `<|im_end|>`, causing generation to degenerate into malformed sequences. The underlying capability remains, hidden behind broken formatting. The authors classify outputs into four structural states: healthy tool call, healthy response, text pollution, and full collapse, giving practitioners a cleaner vocabulary for diagnosing where a failing run actually sits.

On the BFCL-V3 benchmark using Qwen2.5-1.5B-Instruct and Qwen3-1.7B with 300 training instances, the authors tested several remedies. A vanilla RL baseline averaged 3.50. Interleaving supervised fine-tuning with RL brought that to 17.25. Their novel Process Reflection Supervision approach reached 25.75. The code is released at github.com/hypasd-art/Tool-RL-Box.

The tradeoff to watch: the SFT-hybrid fix that stabilizes training also produces sharp degradation when models encounter out-of-distribution format and content scenarios. Teams validating only on in-distribution benchmarks may be measuring the wrong thing. The study was also conducted on models in the 1.5B to 1.7B parameter range, so whether the mechanism and its remedies hold at the larger scales common in production agentic pipelines is not yet answered.

For practitioners, the diagnostic reframe is the most immediately actionable finding: before assuming capability regression, check whether control token distributions are spiking. Treating output structural health as a first-class training signal could enable earlier intervention than waiting for benchmark scores to fall.