huggingface.co web signal

Vera: Caltech and Netflix Propose Compositing-First Video Editing

video generation generative ai ai video video-editing diffusion-models

TL;DR

  • Vera generates an edit layer and alpha matte composited onto the source video, so unchanged pixels are never regenerated by the model.
  • Vera-1.3B surpasses the strongest open-source baseline by 3.5 dB PSNR on background change and 6.3 dB on object addition.
  • The model uses three interacting DiTs totaling 3.9B parameters (1.3B variant), trained on 486K frames of layered video data.

Most diffusion-based video editors share a core flaw: they regenerate every pixel, which means regions that were never supposed to change, including a character's face or the background scene, often drift anyway. Vera, a framework from researchers at Caltech and Netflix, takes a different approach structurally. Instead of regenerating the full clip, Vera produces an edit layer and an alpha matte that are composited onto the original source video, so the pixels that should not change are simply never touched.

The architecture behind this is a Mixture-of-Transformers (MoT) design, with three separate diffusion transformers (DiTs), one each for the edit layer, the alpha matte, and a composite video, interacting through joint self-attention. All three DiTs are initialized from the Wan2.1 text-to-video model. The 1.3B variant yields 3.9B parameters in total; the 14B variant yields 42B. Training used 486K frames of layered data at 832x480 resolution, curated from synthetic composites, real-world video sources, and scenes with complex visual effects including shadows and reflections.

The content preservation gains over existing open-source video editors are large. According to the paper, Vera-1.3B surpasses the strongest baseline by 3.5 dB PSNR on background change and 6.3 dB on object addition, while reducing structural error and perceptual distance by more than half. Vera-14B extends these to 4.5 dB and 7.1 dB respectively. A human preference study with 19 annotators and 513 valid trials confirmed that annotators preferred Vera-1.3B over all five baseline models on content preservation and instruction compliance.

The honest caveat is inference cost. Vera-1.3B takes roughly 8.3 minutes per clip on a single A100 with 21.8 GB peak VRAM, around three times slower than VACE, its closest baseline. The authors note this overhead can be reduced with kernel fusion and sequence parallelism. The evaluation is also limited to two task types, object addition and background replacement; the paper notes that extending to relighting, complex visual effects, and other operations will require new layered training data that does not yet exist.

The more durable point is format. The edit layer and alpha matte Vera produces are standard compositing assets, compatible with iterative post-production workflows. A VFX artist can take them and refine further in any compositing tool rather than treating the model output as a final, opaque render. For studios where a single unintended pixel change can render an edit unusable, that separation between what the model generates and what it leaves alone is the thing worth watching.