TD Objective Improves Diffusion Model FID at Few Sampling Steps
TL;DR
- The paper introduces a temporal difference objective that penalises inconsistency across the full denoising trajectory rather than only at adjacent time steps.
- It reframes diffusion as a Markov reward process and denoising as a policy evaluation problem, unifying discrete-time and continuous-time formulations.
- Reported FID gains are strongest when the number of sampling steps is small, the regime where few-step samplers and low-compute serving live.
A paper posted to arXiv this month takes one of the most useful ideas in reinforcement learning, temporal difference learning, and uses it as an extra training signal for diffusion models. The pitch from the authors is that standard diffusion training only checks predictions at individual time steps or adjacent pairs, which leaves the model free to drift in inconsistent ways across the full denoising trajectory. Their proposal, posted on arXiv on June 13, is a TD objective that penalises that drift.
The framing they use is to treat the diffusion process as a Markov reward process and the denoising network as something doing policy evaluation, the well-studied reinforcement learning problem of estimating long-run value. That reframing is what lets them write down a single TD-style loss that works for both discrete-time and continuous-time diffusion formulations, rather than needing a separate trick for each.
Why this matters in practice is the sampling-budget angle. The biggest cost of using a diffusion model at inference is the number of denoising steps you have to run, and a lot of recent work has been chasing few-step samplers that can produce a decent image in a handful of passes instead of dozens. The authors report that their TD training improves FID, with the gains strongest when the number of sampling steps is small. If that holds up under independent replication, the same trained checkpoint gets cheaper to actually serve.
The honest caveat is what the paper does not hand a casual reader. The abstract describes the FID improvements as significant but the public page does not put numbers next to specific datasets or baselines, and the authors include ablations on pairwise loss reweighting, regularization weight, and one-step stride that suggest the recipe is sensitive to tuning rather than fully drop-in despite the framing. There is also no head-to-head shown against the now-large family of consistency-model approaches that target the same few-step regime.
If the method really is a general drop-in, the people who benefit most are teams running diffusion models in latency- or cost-constrained settings, anywhere shaving sampling steps actually moves the needle. That is the part worth watching as the code and any follow-up benchmarks land.
Shared on Bluesky by 2 AI experts
-
Temporal Difference Learning for Diffusion Models (ICML 2026) arxiv.org/abs/2606.15048 By Yangchen Pan (my former PhD student) and co-authors. It reformulates diffusion training as a Markov reward process and introduce…
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: Temporal Difference Learning for Diffusion Models