huggingface.co web signal

EvoPolicyGym: GPT-5.5 places top-two on all 16 RL tasks

agents openai ai-research

TL;DR

  • EvoPolicyGym has a coding agent repeatedly edit an executable RL policy under a fixed 128-episode interaction budget across 16 Gymnasium-compatible environments.
  • GPT-5.5 obtains the highest Core16 aggregate rank score (0.891) with nine wins and top-two placement on every environment; Claude Opus 4.7 is second (0.750).
  • MiniMax-M3 (0.531) and DeepSeek-V4-Pro (0.359) each win one environment but land near the random-policy anchor (0.03 and 0.19) on synthesis-dominant tasks.

A new benchmark paper on Hugging Face, EvoPolicyGym, tries to answer a question that has been drifting through the coding-agent literature all year: when a language model is asked to iteratively improve an executable policy under a real feedback budget, how much better is one model actually than another? The setup, from authors across USTC, CUHK, Macau, Tsinghua, Zhejiang, Soochow, Brown and Shanghai Jiao Tong, calls this 'Autonomous Policy Evolution' and pins it down concretely: a harness-model agent gets a live workspace, a Python policy entry point, and a fixed 128-episode interaction budget on each of sixteen Gymnasium-style environments.

The headline result on the Core16 suite is that GPT-5.5 obtains the top aggregate rank score of 0.891, with nine first-place environments and top-two placement on all sixteen. Claude Opus 4.7 sits second at 0.750, with five wins and twelve top-twos, and the best MiniGrid family score of 0.938. MiniMax-M3 (0.531) and DeepSeek-V4-Pro (0.359) each win exactly one environment, HalfCheetah and Roundabout, and the uniform random-policy anchor scores 0.109. The authors read this as a coverage story rather than a win-count story: GPT-5.5 is the only entry that stays near the top on every task.

What makes the paper more interesting than a straight leaderboard is the mechanism analysis. The authors split environments into synthesis-dominant (where a policy must build perception, memory, or planning machinery) and tuning-dominant (where a plausible controller already exists and improvement is about gains and thresholds). On synthesis-dominant tasks, GPT-5.5 and Claude Opus 4.7 turn structural edits into new validation bests 41% and 48% of the time, while MiniMax-M3 and DeepSeek-V4-Pro land at 10% and 3% and, per the paper, 'solve none of the three locked-door MiniGrid tasks.' The relative held-out scores in that group tell the same story (0.98, 1.00, 0.19, 0.03).

The honest caveat is that this is a harness-and-model comparison, not a pure model comparison: GPT-5.5 runs through the Codex harness while the other three run through Claude Code, and the authors explicitly do not normalize token use, context management, or provider inference defaults. Token accounting is reported only as a diagnostic and excluded from the rank score, so the cost side of 'which agent converts budget into policy' is left open. What the paper does not give you is a comparison against a trained PPO or SAC baseline on the same held-out pools, and the authors note that the 128-episode budget is far below where standard RL methods converge.

For anyone building agent evaluations, the useful export here is the protocol itself: hidden validation checkpoint selection, held-out generalization, and trajectory-level diagnostics that separate 'the model got a high score' from 'the model discovered the right mechanism.'