huggingface.co web signal

OPID Boosts Agentic RL With Self-Distilled Trajectory Skills

agents fine-tuning reinforcement-learning agents distillation

TL;DR

  • OPID improves over GRPO by +9.3 points on ALFWorld and +10.9 points on WebShop with a 3B model.
  • Skills are extracted from the agent's own training rollouts and are not needed at inference time.
  • Using 80% of training data, OPID outperforms GRPO trained on 100% of data on ALFWorld (78.9% vs 75.0%).

The core tension in training language agents with reinforcement learning is that rewards arrive at the end of long action sequences, leaving the model to guess which of its many decisions actually mattered. Researchers from Tsinghua University, Zhejiang University, the Chinese University of Hong Kong, Nanyang Technological University, and Tongji University propose OPID, detailed in a paper on Hugging Face, as a way to turn every completed training rollout into a denser source of supervision.

The key idea is hierarchical hindsight. After each episode finishes, an analyzer model reads the completed trajectory and extracts two types of skill: an episode-level skill capturing global workflow patterns or failure-avoidance rules, and step-level skills capturing local decision knowledge at critical timesteps. A routing mechanism then decides which granularity to inject at each step during training, converting those skills into token-level log-probability shifts that act as dense training signals alongside the standard outcome-based reward.

What the authors call "on-policy" is the important qualifier. Prior skill-based methods pull from external libraries or retrieve skills from unrelated trajectories, which can introduce a distribution mismatch between what the agent actually experiences and what the retrieved skills describe. OPID avoids that by extracting skills only from the agent's own current rollouts. Those skills are used only during training: the final deployed agent carries no skill retrieval machinery and requires no privileged context at test time.

On ALFWorld, a household task benchmark, OPID improves over plain GRPO by +9.3 percentage points with a 3B-parameter model and +8.8 points with a 7B model. On WebShop, an e-commerce environment, the 3B model gains +10.9 points. On unseen task splits of ALFWorld, OPID averages +7.7 points over GRPO, suggesting the internalized skills generalize rather than overfit. The sample efficiency result is striking: the OPID-trained 3B model using 80% of training data scores 78.9% on ALFWorld, outperforming GRPO trained on 100% of the data at 75.0%.

The honest caveat is that skill extraction relies on a separate analyzer model during training -- the paper uses GLM-5.2 -- which adds a dependency the paper does not extensively stress-test. How sensitive results are to analyzer quality, and how the method performs on less structured benchmarks beyond ALFWorld and WebShop, remain open questions. Code is available at github.com/jinyangwu/OPID. For teams fine-tuning small models for agentic tasks, the sample efficiency result is the most actionable finding: squeezing more supervision from each rollout is a direct lever on training cost.