ACOER cuts reasoning model tokens 60% while keeping accuracy
TL;DR
- ACOER reportedly reduces token generation by over 60% while improving overall accuracy compared to the base model on math reasoning benchmarks.
- The authors identify that GRPO's group normalization causes 'reward collapse' when incorrect answers receive continuous length penalties.
- The method isolates brevity bonuses to correct completions and adds dynamic budget normalization plus control-loop penalty adjustments.
Training reasoning models to be shorter without making them dumber is harder than it sounds, and a new paper on arXiv from Jungseob Lee and colleagues argues that a popular shortcut for doing it is structurally broken.
The setup is familiar to anyone who has watched the long-chain-of-thought wave: take a base reasoning model, train it with Group Relative Policy Optimization, and bolt on a reward that penalizes long answers so the model learns to be terse. The authors report that this combination frequently triggers what they call reward collapse, where the model's reasoning capability degrades severely. Their diagnosis is mechanical rather than vibes-based. GRPO normalizes advantages within a group, and when wrong answers keep getting hit with a continuous length penalty, the resulting advantages diverge in a way that destabilizes optimization. In their words, "methods penalizing the length of incorrect answers are structurally prone to collapse under sustained optimization."
Their fix, which they call ACOER (Adaptive Correct-Only Efficiency Reward), does two things at once. It only hands out brevity bonuses for completions that were actually correct, which removes the divergent-advantage trap. And because correct-only rewards on their own can push the model into what the paper calls a "stochastic collapse driven by response over-compression", they add dynamic budget normalization and a control-loop adjustment on the penalty term to keep things stable. The headline result, across what the paper describes as diverse mathematical reasoning benchmarks, is that ACOER "improves overall accuracy compared to the base model while reducing token generation by over 60%."
The honest caveat is that the abstract is the public surface here and it stays at the level of mathematical reasoning benchmarks; it does not name the specific benchmarks, base model, or comparison length-penalty baselines in the part that is publicly visible, so the over-60% figure should be read as the authors' claim on their evaluation suite rather than a settled result. The paper is short, thirteen pages with three figures and seven tables, and the code is on GitHub, which is the part most practitioners will care about.
If the diagnosis holds up under independent replication, the interesting downstream effect is on serving costs. A real 60-plus percent cut in tokens generated, without an accuracy hit, is the kind of efficiency win that compounds quickly for anyone running reasoning models in production.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards