arxiv.org web signal

New RSI metric picks tokens that matter for RLVR training

TL;DR

  • The Relative Surprisal Index combines a token's entropy with its selected probability to decide which tokens drive RLVR updates.
  • RSI-S filtering reportedly beat baseline GRPO on Qwen2.5-1.5B, 3B and 7B by 2.10 to 3.30 points on AIME and AMC math benchmarks.
  • Response lengths also fell by 108 to 265 tokens across the tested Qwen2.5 sizes, suggesting shorter outputs alongside the accuracy gains.

Two camps have been arguing about token-level RL for reasoning models. One says the updates that matter are on high-entropy tokens, the branching points where the model is genuinely uncertain. The other says the dangerous tokens are the low-probability ones, where the policy is about to do something unstable. A new arxiv preprint from Outongyi Lv and collaborators argues both camps are describing the same underlying signal, and proposes a single metric that captures it.

The metric is the Relative Surprisal Index, which the authors describe as coupling a token's entropy with the probability the policy assigned to the token it actually picked. The filter built on top of it, RSI-S, keeps tokens whose RSI sits in a stable middle band and drops both the redundant low-surprisal ones and the volatile high-surprisal outliers. That is the whole trick. Applied on top of standard GRPO, the paper reports avg@32 accuracy gains of 2.10 points on Qwen2.5-1.5B, 3.30 on 3B, and 2.19 on 7B, tested on AIME and AMC math benchmarks. Response lengths reportedly fell by 108 to 265 tokens at the same time, which is the more useful number if you are paying for inference.

The honest caveat is that this is a preprint on math-only benchmarks, and the summary does not tell you how RSI-S behaves on code, tool use, or longer-horizon reasoning, nor how sensitive the results are to the filter's threshold or to entropy calibration in the base model. The competing token-selection papers in this space keep beating each other by two to three points, so treat the specific deltas as a direction of travel rather than a settled ranking.

What is worth watching, if you run RLVR at all, is whether a single information-theoretic metric can quietly replace the growing pile of heuristic token filters people have been bolting onto GRPO. If it holds up outside math, that is a real simplification.

Shared on Bluesky by 2 AI experts