iRDM Post-Trains FLUX.2 To One Step, Beats 4-Step Teacher
TL;DR
- iRDM reports 1.30 on the SWr14 sliced-Wasserstein score on ImageNet-256, down from a prior one-step SOTA of 2.05 with real data at 1.0.
- A one-step FLUX.2 [klein] post-trained with iRDM scores 0.826 on GenEval vs 0.794 for the four-step teacher, in about 90 H200 GPU-hours.
- Generated batch sizes above 2048 and matching against 14 frozen encoders are the two design choices the authors credit for the result.
The interesting result in this Hugging Face paper is not the leaderboard number, it is that a one-step image generator apparently beats its four-step teacher on the teacher's own turf. Researchers at EPFL and Valeo.ai propose Representation Distribution Matching, a training objective that aligns the distribution of features from generated images with those from real images under frozen pretrained encoders, without an online teacher, adversary, or trajectory simulation.
The specifics are worth pinning down. On ImageNet-256, the improved version they call iRDM lands at 1.30 on SWr14, a sliced-Wasserstein score they define over 14 frozen encoders with 4 held out from training to resist gaming. Real validation data scores 1.0 by construction, and they report the prior one-step state of the art at 2.05. PickScore, a human-preference proxy the objective never optimizes, prefers iRDM samples over the prior best one-step generator on 71.2% of matched pairs, and prefers them over real photos 63.6% of the time. For text-to-image, they post-train the four-step FLUX.2 [klein] into a one-step generator and report 0.826 on GenEval against 0.794 for the four-step version, and 22.76 vs 22.58 on PickScore, in 90 H200 GPU-hours.
The mechanics are the unusual part. They rehabilitate the classical Maximum Mean Discrepancy, an objective people had mostly written off for image generation, and estimate it with exact within-batch repulsion plus a Nyström-approximated reference mean. The generated batch size turns out to be the operative knob, with an optimum above 2048, far beyond customary batch sizes. And because any single encoder can be gamed, driven below the real score while the images stay visibly fake, they match against a balanced battery of encoders and evaluate on encoders the training loss never touched.
The honest caveats: the SWr14 real-data floor is 1.0, so 1.30 is closer, not solved. The gains against FLUX.2 [klein] are on GenEval and PickScore, both proxies rather than end-user studies. What the reporting does not give you is inference-time cost on production hardware, released code or weights, or whether the recipe holds on larger FLUX variants or on video. If it generalizes, the payoff is real: text-to-image models that are genuinely one step at serve time, at a fraction of the usual quality penalty, for anyone who does not want to pay for four passes per sample.
Originally reported by huggingface.co
Read the original article →Original headline: HF Paper: 'Representation Distribution Matching' Cuts One-Step Visual Generation Gap With Multi-Step Diffusion