LiveEdit Hits 12.66 FPS on Streaming Diffusion Video Edits
TL;DR
- LiveEdit reports 12.66 FPS for streaming video editing via a three-stage distillation pipeline from a bidirectional foundation model to a unidirectional streaming editor.
- An AR-oriented mask cache reuses region-related computation across frames, reducing redundant processing during inference.
- The authors establish a dedicated benchmark for streaming video editing and claim state-of-the-art visual quality among streaming baselines.
Real-time diffusion video editing has been the obvious next thing for a while, with the obvious obstacle being that diffusion is slow and editing video adds the extra constraint that the bits you did not ask to change need to actually stay put. A new arXiv preprint called LiveEdit, posted on 25 June by Xinyu Wang and collaborators, claims to push streaming video editing to 12.66 FPS, which the authors frame as fast enough for interactive and augmented reality applications.
The approach is a three-stage distillation pipeline. The authors take a powerful bidirectional foundation model, the kind that can see the whole clip when it edits, and progressively compress its capability into an efficient unidirectional streaming editor that processes video frame by frame as it arrives. To squeeze runtime further they introduce what they call an AR-oriented mask cache that reuses region-related computation across frames, so the system is not redoing work on regions the user did not ask to change. The paper also ships a dedicated benchmark for streaming video editing and reports state-of-the-art visual quality on it against streaming baselines.
Why this matters if you are not training video models: streaming edits unlock a different set of products from clip-at-a-time editing. Live virtual-production touch-ups, in-headset AR effects that adapt to what the camera sees, and edit-while-recording creator tools all want the same shape of capability the authors are describing. Two of the named problems the paper opens with, stable backgrounds and non-edited regions over time, are exactly the things that have made earlier real-time attempts feel unusable.
The honest caveats are the ones the abstract does not address. 12.66 FPS is below the smoothness threshold most viewers associate with live video, the comparison baselines are unnamed in the abstract, and the benchmark is the authors' own, so external replication is still open. What the write-up also does not give you is the hardware the number was measured on, the resolution, or how the system holds up across the long clips its content-preservation pitch is built around.
Still, the direction is the part to watch. If frame-by-frame streaming edits are now possible at all without melting the visuals, the rest of the stack, better teacher models, lighter caches, smarter region tracking, has a clear path.
Originally reported by paper
Read the original article →Original headline: LiveEdit Achieves 12.66 FPS Real-Time Streaming Video Editing via Diffusion Distillation — ECCV 2026