paper web signal

NVIDIA and Tsinghua Open 2-Step Streaming Video Distillation

TL;DR

  • Causal-rCM distills video diffusion to 1-2 sampling steps with a VBench-T2V score of 84.63 on Wan2.1-1.3B.
  • The teacher-forcing consistency model stage converges 10 times faster than discrete-time consistency models.
  • The full training recipe is released publicly and has been applied to NVIDIA's Cosmos 3 world model for autonomous driving.

Video diffusion models typically require dozens of denoising steps to produce coherent output, which makes real-time or interactive generation impractical. A paper from researchers at Tsinghua University, NVIDIA, and UT Austin -- posted to arXiv on June 24, 2026 -- introduces Causal-rCM, described as "a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training" targeting autoregressive streaming video.

The core technical move is extending the earlier rCM framework, which worked on bidirectional rather than autoregressive video, to causal diffusion transformers that generate video frame-by-frame. The approach uses a staged pipeline: teacher-forcing consistency model adaptation, teacher-forcing distillation via a custom-mask FlashAttention-2 JVP kernel, and self-forcing distribution matching as a refinement step. The teacher-forcing stage, according to the paper, achieves 10 times faster convergence compared to discrete-time consistency models.

The quantitative results are specific: the distilled Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 using only 1 or 2 sampling steps, running at 15.9 FPS with 0.40 seconds first-chunk latency in 1-step frame-wise mode and 22.2 FPS in 2-step chunk-wise mode. Training relies entirely on synthetic data generated by the larger Wan2.1-14B teacher model. The team also applied Causal-rCM to Cosmos 3, NVIDIA's multimodal world foundation model, enabling action-conditioned streaming generation for autonomous driving scenarios.

The honest caveat is that the FPS and latency numbers do not specify GPU hardware, which matters considerably for assessing deployment feasibility. Training only on synthetic data from a 14-billion-parameter teacher also leaves open whether the quality ceiling reflects the teacher's own biases rather than the distillation method. The paper describes the Cosmos 3 autonomous driving application qualitatively but does not report quantitative results for it.

What makes this worth tracking is the public recipe. Teams with access to a capable teacher model can now attempt to apply the same pipeline to their own base models, and that kind of openness tends to determine how quickly a capability spreads beyond the labs that first demonstrated it.