huggingface.co web signal

LinkedIn's TRIAGE Adds Role Labels to GRPO, Lifts ALFWorld 7.9pt

TL;DR

  • TRIAGE classifies each trajectory segment as Decisive, Exploration, No-progress, or Regression, then adds a fixed role-conditioned bonus on top of the GRPO outcome advantage.
  • On Qwen2.5-7B-Instruct, TRIAGE reports 87.5% success on ALFWorld versus 79.6% for GRPO, a 7.9-point gap, with similar lifts on WebShop and Search-QA.
  • Beyond accuracy, completed-rollout length shrinks 10.4% on ALFWorld and 14.8% on WebShop, driven mainly by suppressing regressive actions in otherwise successful trajectories.

A recurring frustration with GRPO on agent tasks is that the final verifier signal gets smeared uniformly across every token in a rollout. Useful exploration inside a failed trajectory gets punished, and lazy or regressive actions inside a successful trajectory get rewarded. A new Hugging Face paper from a LinkedIn-led team, called TRIAGE, takes a fairly simple swing at that: add a semantic role axis on top of the outcome advantage.

The mechanism is not fancy. Each trajectory segment is classified by an LLM judge into one of four roles, Decisive, Exploration, No-progress, or Regression, and each role gets a fixed process reward of 1.0, 0.5, -0.1, or -0.5. That role bonus is scaled by a mixing coefficient λ (0.2 on ALFWorld and WebShop, 0.4 on Search-QA) and added to the standard group-normalized GRPO advantage before the usual clipped update. The role constants themselves are never tuned across tasks, which is the part that makes it look like a real drop-in rather than a bespoke recipe.

The headline numbers, as reported, are 87.5% success on ALFWorld with Qwen2.5-7B-Instruct versus 79.6% for GRPO, a gap the authors quote as 7.9 ± 2.8 points across ten runs. WebShop moves from 70.1% to 77.2%, and Search-QA from 43.3% to 48.1%. Completed-rollout length also drops 10.4% on ALFWorld and 14.8% on WebShop, which the ablation attributes mostly to the regression penalty rather than the exploration bonus. Against a shared-backbone scalar value baseline, the biggest gap shows up on WebShop, where redundant attribute clicks leave observations nearly identical and an outcome-trained critic cannot tell productive repeats from wasted ones.

The honest caveats are worth stating. The role labels are semantic estimates from a Qwen3-8B judge with thinking mode on, not ground truth, and the paper concedes the judge can overvalue plausible exploration or miss subtle regressions. Search-QA is reported as a single run because retrieval-augmented rollouts are expensive, so the confidence band there is much thinner than on the other two benchmarks. And the reporting does not give you a dollar or wall-clock comparison against just training GRPO longer with a scalar process reward, which is the comparison a budget-constrained team actually needs.

Still, if the framing holds up under replication, the interesting shift is that a lot of the value in agentic RL may live in structured, auditable per-segment labels rather than in another learned critic. That is a friendlier direction for teams that want to inspect what their trainer thinks a rollout did, not just what score it landed on.