Distribution-wise rewards cut reward hacking on SiT and EDM2
TL;DR
- The paper claims conventional sample-wise rewards drive reward hacking that hurts image diversity, and proposes a distribution-wise batch reward as the fix.
- Reported FID-50K drops on SiT from 8.30 to 5.77, and on EDM2 from 3.74 to 3.52, under distribution-wise rewards.
- A subset-replace strategy is introduced to keep the batch-level reward tractable by updating only a small subset of the reference set.
For anyone poking at reinforcement learning on top of image generators, the failure mode is familiar. Score the model per sample and it happily collapses toward whatever narrow slice the reward function likes, losing diversity and picking up visual artifacts along the way. A new arXiv paper, Optimizing Visual Generative Models via Distribution-wise Rewards, argues the fix is to stop evaluating samples one at a time.
The authors, Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu, Fuli Feng, Han Hu and Wenjie Wang, are direct about the failure they are targeting: conventional sample-wise rewards produce "reward hacking that degrades image diversity and introduces visual anomalies." Their alternative is a distribution-wise reward that accounts for the data distribution of a batch of samples, which they say mitigates the mode collapse problem that occurs when all samples optimize towards the same direction independently.
The concrete gains are on FID-50K. On the SiT backbone, the paper reports the score dropping from 8.30 to 5.77 under distribution-wise rewards. On the stronger EDM2 model, where there is less headroom, it moves from 3.74 to 3.52. To keep the batch-level reward tractable, the paper introduces a subset-replace strategy that provides reward signals by updating only a small subset of a generated reference set. A separate contribution applies RL to optimize post-hoc model merging coefficients, aimed at the train-inference inconsistency introduced when regular RL practices use stochastic differential equation (SDE) sampling.
The honest caveat is scope. These are FID gains on the SiT and EDM2 backbones, not text-to-image with prompts and human preference rewards, which is where most of the actual product pressure sits today. The abstract does not settle whether the required batch size is friendly at scale, or how the distribution-wise trick behaves when the reward signal is a noisy learned preference model rather than a clean statistical distance.
Still, the direction is worth watching. If per-sample rewards are the reason RL post-training has been a landmine for image models, teams building on SiT, EDM2 and similar backbones now have a concrete recipe to try that keeps diversity intact while chasing reward, which is the kind of small structural change that quietly unblocks a lot of downstream work.
Originally reported by paper
Read the original article →Original headline: ICML 2026: Batch-Level Rewards Cure Reward Hacking in Image Generation RL