MVTrack4Gen Lifts 4D Video Consistency via Attention Supervision
TL;DR
- Intermediate attention layer 18 in video diffusion transformers naturally encodes both temporal and cross-view correspondence cues, even without explicit 3D training.
- Adding a point-tracking head and correspondence loss during training cuts MEt3R geometric error from 3.660 to 1.858 on the DAVIS benchmark with the ReCamMaster backbone.
- The training framework generalizes across two different video diffusion backbones, ReCamMaster and Redirector, without requiring 3D reconstruction at inference time.
Most approaches to generating novel-view video from monocular footage split into two camps: lean on explicit 3D reconstruction to enforce geometric consistency, or skip reconstruction and condition only on camera parameters for visual quality. The first camp struggles with dynamic objects where off-the-shelf reconstruction modules break down. The second gets cleaner images but loses geometric and motion consistency across views. According to the paper on Hugging Face, researchers from KAIST AI and Sony AI found a way to get most of what the reconstruction camp promises, without the reconstruction step at inference.
The key observation is that specific intermediate attention layers in video diffusion transformers -- around layer 18 in the architectures studied -- already attend to geometrically corresponding regions across views and over time, even in models never explicitly trained for 3D correspondence. When these attention weights misalign, motion inconsistency follows. Rather than retrain from scratch or insert a 3D module, the team built a training framework called MVTrack4Gen that adds a multi-view point-tracking head at that layer and supervises it with two complementary objectives: a tracking loss on predicted point positions, and a correspondence loss directly on the attention weight matrices.
On the DAVIS benchmark with the ReCamMaster backbone, the geometric consistency metric MEt3R dropped from 3.660 to 1.858 -- a substantial improvement. For dynamic objects specifically, MEt3R_dynamic fell from 0.113 to 0.100 on ReCamMaster and from 0.086 to 0.073 on the Redirector backbone. The method also improved PSNR and LPIPS on the iPhone dataset. Importantly, it applied to two different backbone models, which suggests the layer-18 correspondence property is not a quirk of one particular architecture.
The honest caveat is that not every metric moves in the right direction: SSIM on the iPhone dataset fell from 0.338 to 0.270 with the ReCamMaster backbone, and the training pipeline requires 4 NVIDIA H100 GPUs for 13,000 iterations -- not a casual fine-tune. The reported results also cover two specific benchmarks; how the approach handles scenes with very dense motion or heavy occlusion is not addressed in the material available.
For anyone building or deploying video diffusion systems, the more durable finding may be the architectural one: that correspondence structure is already encoded in intermediate attention layers before any explicit 3D supervision is applied. If that property holds broadly as new backbone architectures emerge, it opens a low-overhead path toward geometric consistency in 4D video generation -- the training objective does the work, and inference stays lean.
Originally reported by huggingface.co
Read the original article →Original headline: MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision Achieves SOTA Consistency on 4D Video Generation Without 3D Reconstruction at Inference Time