Progress Advantage Extracts Free Step Rewards From RL Training
TL;DR
- The log-probability ratio between an RL-trained policy and its reference policy mathematically recovers the optimal advantage function, with no annotation required.
- Progress advantage surpassed dedicated trained reward models across five benchmarks and four model families without any task-specific training.
- In best-of-8 test-time scaling, progress advantage reached 38.8% success on Gemma4-4B versus 29.0% and 27.4% for two confidence-based baselines.
Building process reward models for agents has been one of the more stubborn cost problems in RL post-training: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale, according to a new paper by Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick, and Sharon Li.
Their central argument is that the signal was already there. The paper derives what the authors call "progress advantage" -- the log-probability ratio between an RL-trained policy and its reference policy -- and shows this ratio exactly recovers the optimal advantage function. The result is described as "annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline." The formulation holds for algorithms with explicit KL regularization, including GRPO and PPO, as well as for clipping-based surrogates like DAPO.
The validation spans five benchmarks and four model families: Gemma4-4B, Qwen3.5-9B, Qwen3-14B, and Olmo3-7B. In test-time scaling using best-of-8 sampling, progress advantage reached 38.8% average success on Gemma4-4B, against 29.0% for Self-Certainty and 27.4% for DeepConf. On Qwen3.5-9B the gap was wider: 62.1% versus baselines in the 51-55% range. On uncertainty quantification, the method showed substantially higher AUROC than all baselines for trajectory success prediction on τ²-bench. On failure attribution, it was competitive with task-specific trained methods on the Who & When benchmark. Across all settings, it reportedly surpassed dedicated trained reward models without any task-specific training.
The honest caveat is that the derivation is tied to stochastic MDPs with KL regularization. RL variants that do not satisfy those conditions may not inherit the theoretical guarantees, and the paper does not address how the advantage signal behaves as the trained policy drifts far from its reference over many training iterations -- a real concern in long production runs.
For practitioners already running GRPO, PPO, or DAPO: if you kept your reference checkpoint, you already have the step-level scoring signal. Better test-time selection, trajectory evaluation, and failure diagnosis are now effectively free.
Originally reported by paper
Read the original article →Original headline: RL Training Already Contains a Free Step-Level Agent Reward—No Annotation Needed