huggingface.co web signal

HKUST-Baidu paper reframes Self-Flow as data augmentation

TL;DR

  • Paper argues Self-Flow's dual-timestep boost to diffusion transformer training is data augmentation along the noise dimension, not cleaner-to-noisier token interaction.
  • Attention Separation blocks cross-noise-level token attention yet FID drops from 25.19 to 25.06 and IS rises from 66.75 to 72.94 at 800K iterations on SiT-B.
  • Their SiT-XL/2 recipe reaches FID 1.44 on ImageNet 256x256 in 4M steps, versus vanilla SiT-XL/2 at FID 2.06 with 7M steps.

The interesting question a new paper from HKUST, Zhejiang University of Technology, and Baidu asks is why a specific tweak to diffusion transformer training actually works. The prior story, from Self-Flow, was that mixing tokens at two noise levels in the same image forces the model to use cleaner tokens to help denoise noisier ones, and that this token-to-token guidance is what produces the gains over the older SRA baseline. According to the Hugging Face paper page, the authors argue that this self-supervision explanation is not the load-bearing one.

Their diagnostic is a technique they call Attention Separation. It keeps Self-Flow's dual-noise input but blocks tokens at one noise level from attending to tokens at the other. If interaction were doing the work, blocking it should hurt. It doesn't. In their controlled ablations on ImageNet 256x256 with SiT-B, cutting the interaction actually nudges the numbers the right way, reducing FID from 25.19 to 25.06 and increasing IS from 66.75 to 72.94 at 800K training iterations. Their reading is that the real gain comes from exposing the model to more noise-state variants of the same image, which is data augmentation dressed up as self-supervision.

Why this matters for anyone actually training these models is that representation alignment is one of the standard tricks people reach for to speed up diffusion transformer training, and until now the leading self-alignment recipes came bundled with a mechanistic story about token interactions. If the paper is right, you can drop that story, keep the noise-level augmentation, and get a simpler training loop. Combined with Attention Separation, their SiT-XL/2 recipe reaches FID 1.44 on ImageNet 256x256 in 4M steps, versus the vanilla SiT-XL/2 baseline's 2.06 at 7M steps, and matches REPA's FID 2.08 on 512x512 without relying on an external pretrained encoder like DINOv2.

The honest caveat is that this is an interpretation paper first, and the validation is on class-conditioned ImageNet with SiT backbones. Self-Flow's original scope reportedly stretched to text-to-image, text-to-video and text-to-audio, and this paper does not retest that. There is also a practical gotcha in the ablations: at mask ratio 0.50, Attention Separation on its own pushes 800K FID to 38.19 unless you mix in full-image single-timestep samples, so the design still needs tuning. What the reporting does not give you is a head-to-head against the strongest external encoders at larger scale, so take the parity claim as a same-condition result rather than a general one.

The upside worth watching is for teams training DiTs on modest budgets: no external encoder dependency, no need to buy the self-supervision explanation to reproduce the gains, and a small augmentation you can slot into an existing self-alignment codebase.