huggingface.co web signal

HKUST's DomainShuttle Posts 18.7% Cross-Domain Video Score Gain

video generation multimodal video-generation personalization multimodal

TL;DR

  • DomainShuttle achieves an 18.7% improvement in Cross-Domain Score over state-of-the-art subject-driven video generation methods.
  • Three components -- Domain-MoT, Video-Reference DualRoPE, and Cross-Pair Consistent Loss -- decouple subject features from domain-specific attributes.
  • Training required approximately 30,000 GPU-hours across two stages on a 200K image dataset and a 750K video dataset.

Most subject-driven video generation research has treated fidelity and creative flexibility as competing goals: existing methods, according to researchers at Hong Kong University of Science and Technology, primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability when a prompt calls for novel styles, semantic combinations, or domain attributes.

Their new framework, DomainShuttle, proposes joint optimization of subject consistency and generative flexibility through three components. Domain-MoT (Mixture-of-Transformers) decouples video and reference features into independent processing branches and introduces a domain-aware AdaLN mechanism that explicitly injects domain attributes into the reference image branch rather than modulating both branches the same way. A Video-Reference DualRoPE scheme assigns reference image tokens their own positional encoding space, separate from video tokens, enabling more precise spatial modeling of individual subjects. A Cross-Pair Consistent Loss trains the model simultaneously against two different sets of reference images of the same subject, pushing it to learn intrinsic features that persist across viewpoints and lighting rather than copying surface details from a single frame.

The team trained the model on Wan2.1-14B-T2V and Wan2.2-14B-T2V base models, using a 200K image dataset in a first stage and a 750K video dataset in a second stage, at a total reported training cost of approximately 30,000 GPU-hours. Against a test set of 110 in-domain and 110 cross-domain samples, with baselines including Kling 1.6, Phantom, and VACE, DomainShuttle records an 18.7% improvement in Cross-Domain Score over the best prior methods.

The honest caveat is that the cross-domain evaluation metrics -- CD-Score, Qwen-Score, NANO-CLIP, and Qwen-CLIP -- were designed specifically for this paper, so independent community validation of what they actually capture is still pending. The paper also does not report inference latency or memory requirements for the backbone models, which matters for anyone evaluating whether the approach is deployable at scale.

The researchers cite advertising, creative design, and AI filmmaking as the target application areas. If the results generalize, the more concrete operational benefit is being able to run a single model that handles both standard in-domain personalization and cross-domain style transfer, rather than maintaining separate systems for each use case.