FID Scores Carry Hidden Training-Seed Randomness, Study Finds
TL;DR
- Retraining a model with a different seed moves its FID score 3.2x more than resampling from a fixed trained network.
- FID coefficient of variation stays within a 1-2% band even as compute or model size increases.
- The authors recommend treating any FID gap below roughly 1.3% CoV as inconclusive and requiring multi-seed error bars.
Fréchet Inception Distance is the number that decides which image generation model wins, which paper gets accepted, and which lab claims the state of the art. A new paper on arXiv by Nicolas Dufour, Alexei A. Efros, and Patrick Pérez argues that this number has a serious randomness problem baked into how models are trained, and that the field has been largely ignoring it.
The team trained hundreds of networks on ImageNet 256x256, treating FID not as a fixed score but as a random variable across different training and generation seeds. The headline finding: retraining the same model using the same recipe but with a different seed moves FID 3.2x more than simply resampling from a fixed trained network. Three factors drive that gap — random initialization, data ordering, and per-step Gaussian noise in the flow-matching loss. Scaling does not rescue the situation: increasing compute or model size barely tightens the spread, holding the FID coefficient of variation inside a 1-2% band regardless.
The practical implication is uncomfortable. A lucky training seed can reach the same FID with up to 2x less compute than an unlucky one. That means leaderboard gaps that look like genuine capability differences could be, at least in part, the model equivalent of winning a draw. The authors recommend treating any FID gap below the empirically measured coefficient of variation of roughly 1.3% as inconclusive, and reporting an error bar over several training seeds rather than a single best number. Classifier-free-guidance optimization halves the spread, though it changes which seeds perform best.
The honest caveat is that these findings come from one setting — ImageNet 256x256 — and the paper does not give you a clear answer on how many training seeds are practically sufficient for reliable error bars, or whether other common metrics face the same problem. For teams without large compute budgets, the multi-seed protocol may be aspirational rather than immediately practical.
For practitioners and reviewers, the 1.3% threshold is the most immediately actionable output: if the gap between two models is smaller than that, the ranking is on shaky ground, and any claim built on it deserves scrutiny.
Shared on Bluesky by 2 AI experts
-
Babe, stop everything! New favorite paper of the year is out! kyutai.org/fid-lottery/ arxiv.org/abs/2606.20536
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation