huggingface.co web signal

SkillCoach: self-evolving rubrics grade LLM agent skill use

agents hugging face ai-business

TL;DR

  • SkillCoach, from HKUST(GZ) and JD.COM, evolves per-task rubrics scoring four trajectory dimensions: skill selection, following, composition, and reflection.
  • Rubric-filtered SFT lifts Qwen3.5-4B from 8.0 to 24.0 and Qwen3.5-9B from 14.0 to 32.0 final accuracy under gold-plus-distractor libraries.
  • In a 50k SKILL.md stress test, Gemini 3.1 Pro degrades around 45-46 distractors while Opus 4.7 holds until around 194-195.

Most agent evaluations still stop at whether the task passed a verifier. A new paper on Hugging Face from researchers at HKUST(GZ) and JD.COM argues that pass/fail is far too coarse: an agent can hit a green tick while grabbing the wrong skill, skipping a required step, or reaching the answer through trial and error. Their framework, SkillCoach, treats agentic skill-use as a trajectory-level ability with four dimensions (skill selection, skill following, skill composition, and skill-grounded reflection) and evolves a per-task rubric that grades each one, keeping the external verifier as a separate outcome signal.

The rubric-quality numbers are the most concrete result. On 50 paired test instances from 10 held-out task families, the evolved rubric R^best raises gold-keypoint coverage from 71.56 to 83.70, usability from 81.53 to 94.33, and trajectory-filtering consistency from 82.00 to 96.00, while the hallucination rate drops from 2.00 to 0.00. Judging was done by Gemini 3.1 Pro at temperature 0 with the verifier signal held out, so the audit reads as improvement in the rubric artifact rather than a proxy for final success.

Downstream, the evolved rubric is used to filter supervised fine-tuning data. Under a gold-plus-distractors library, rubric-filtered SFT lifts Qwen3.5-4B final accuracy from 8.0 to 24.0 and Qwen3.5-9B from 14.0 to 32.0. Outcome-only filtering does worse, dropping the 4B result from 8.0 to 6.0 and giving only a modest bump to 18.0 for 9B, which the authors read as evidence that verifier-passing trajectories are not automatically reusable demonstrations. Removing the key-step following criterion causes the largest ablation drop, pulling 4B back to 10.0 and 9B to 16.0.

The paper is also honest about where selection breaks. In a distractor-boundary stress test built from 35,554 real SKILL.md documents in the Skill-Usage pool (extended to a 50k library with SkillHub-derived skills), even strong closed models degrade early: Gemini 3.1 Pro around 45-46 distractors, GPT-5.5 around 55-56, and Opus 4.7 around 194-195. Semantically similar distractors are the most damaging: GPT-5.5's selection F1 drops from 0.84 under random unrelated distractors to 0.59 under high-similarity ones, and Opus 4.7 falls from 0.87 to 0.71.

What the reporting does not give you is on-policy reinforcement learning or long-term deployment feedback; the limitations section explicitly leaves the RL reward version to future work, and the task inventory is smaller than a real enterprise repository. For teams sitting on growing internal skill libraries, though, the immediate lesson is smaller and useful: outcome-only training-data curation appears to leave real correctness on the table, and a rubric that grades the process seems to pick better demonstrations than the verifier alone.