paper web signal

Tencent's SkillHone beats deep-research agent by 15.8 GAIA points

TL;DR

  • SkillHone reports a 15.8-point GAIA gain over a commercially-backed deep-research agent and 3.2 points on WebWalkerQA-EN, without a pre-integrated search stack.
  • The mechanism is a persistent log of diagnoses, revisions, evidence, and outcomes that role-separated subagents reuse across sessions.
  • On seven internal tool-mediated scenarios at Tencent, the authors report an average 18.8-point accuracy gain.

Most agent frameworks throw away the reasoning behind every skill revision the moment they accept the next one, which means the next debugging pass starts blind. A new paper from WeChat AI, by Zhiwei Li and Yong Hu at Tencent, argues that just refusing to discard those diagnostic traces is enough to move public benchmarks by a lot.

The system is called SkillHone. It pairs each skill revision with evaluation-side evidence, then hands the structured history of diagnoses, revisions, evidence, and outcomes to role-separated subagents that propose the next change. On GAIA, it reportedly outperforms a commercially-backed deep-research agent by 15.8 points, and on WebWalkerQA-EN by 3.2 points, without a pre-integrated search stack of its own. In seven internal tool-mediated analysis scenarios at Tencent, the authors report an average 18.8-point accuracy gain.

Why this matters if you build agents: the popular framing this year has been that deep research needs a first-class retrieval pipeline stapled to the agent. SkillHone's claim is that a lot of what looks like a retrieval problem is actually a memory problem about the agent's own past mistakes and the rationale behind fixing them. If that holds up, a structured revision log becomes one of the cheapest architectural additions you can bolt onto whatever framework you already run.

The honest caveats are the usual ones for a single-lab benchmark paper. The competitor deep-research agent is not named in the abstract, the seven scenarios are Tencent's own internal eval, and the authors themselves flag that SkillHone currently evolves a single skill in isolation, with coordination across interdependent skills left as future work. Take the 15.8-point delta as reported, not settled.

What is worth watching is whether other teams reproduce the pattern with different base models and on blind evaluations. If they do, the interesting part of the agent stack shifts from which search API you wired up back to what your agent learned last time, and whether you kept it.