HKUST's LISA Regularization Speeds Up ControlNet Training 2.78x
TL;DR
- LISA adds a lightweight auxiliary loss during ControlNet-style training that explicitly aligns the side network's features with an approximated likelihood score, requiring no external encoders.
- LISA achieves more than 2.78x faster convergence on ControlNet with zero added inference cost, since the auxiliary decoder is dropped entirely at test time.
- Applied to pose-guided video generation, LISA improved pose accuracy from 30.22% to 57.00% and reduced Frechet Video Distance from 10.57 to 7.85 at 5K training iterations.
The problem LISA solves is surprisingly basic: nobody knew quite why ControlNet works as well as it does. The dual-branch design -- a frozen pretrained diffusion model as the main network, a trainable side network that encodes visual conditions like pose maps or depth maps -- has been the dominant approach to controllable image generation for years. But the side network's role was, according to the researchers, underexplored. It just worked, without a principled account of what it was doing.
Researchers from HKUST and Huawei Research, writing in a paper posted on Hugging Face, now offer that account. In their score-based analysis, the frozen main network predicts the unconditional score, and the side network implicitly provides the residual: the likelihood score, which steers generation toward the actual condition. The standard training objective supervises only the final output, leaving this role of the side network implicit and, the authors argue, harder to learn efficiently.
LISA fixes that with a small auxiliary loss. During training, a lightweight decoder -- adding about 0.1% additional parameters -- hooks into an intermediate layer of the side network and projects those features into the score space. An approximated likelihood score target is constructed using the frozen main network itself, so no external encoders are needed. At inference, the decoder is dropped entirely, leaving zero additional deployment cost.
The results are reported across pose, depth, segmentation, and video tasks on multiple architectures. In early training for pose-conditioned generation, PCK (a keypoint accuracy metric) improved from 19.38 to 83.02 when LISA was applied to ControlNet. The authors report more than 2.78x faster convergence overall, and that LISA at 4K training iterations outperforms vanilla ControlNet at 10K. For pose-guided video generation, LISA reduced Frechet Video Distance from 10.57 to 7.85 and improved pose accuracy from 30.22% to 57.00% at 5K iterations. Training overhead is about 0.2 seconds per iteration additional on 8 H20 GPUs, with model size increasing from 364.2M to 364.6M parameters.
The honest caveat is that the experiments run at modest iteration counts, and whether the convergence advantage holds at much longer training horizons is not shown. The composability result -- where LISA's more disentangled features allow better combination of multiple conditions -- covers only two conditions at once. Still, for anyone running ControlNet-style adapter training, a method that meaningfully shortens those runs without touching inference is worth a look.
Originally reported by huggingface.co
Read the original article →Original headline: LISA: Likelihood Score Alignment Regularization Accelerates Training and Improves Quality Across Controllable Image and Video Generation Tasks