TryOnCrafter Lets Shoppers View Clothing From Any Camera Angle
TL;DR
- TryOnCrafter defines a new task, Camera-controllable Video Virtual Try-on (CaM-VVT), decoupling synthesis from fixed source video camera trajectories.
- The system builds a clothed 3D Gaussian Splatting avatar animated via SMPL-X sequences as a geometric anchor for novel-viewpoint video generation.
- On the 180-sample ViViD-S benchmark, TryOnCrafter achieves a paired VFID_I of 9.6085, beating the prior best DreamVVT at 11.0180.
Every video virtual try-on system built to date shares one quiet constraint: it can only show the garment from whatever angle the original camera recorded. Walk around the model, check the back panel of a jacket? Not possible -- until now, according to a new paper from researchers at the Institute of Computing Technology (Chinese Academy of Sciences), the University of Chinese Academy of Sciences, Xiamen University, and Alibaba Group. They name the new task Camera-controllable Video Virtual Try-on (CaM-VVT), and their system is called TryOnCrafter.
The core move is to stop treating video generation as a purely pixel-space problem and build an explicit 4D model of the clothed person first. TryOnCrafter constructs what the authors call a Renderable 4D Try-on Proxy: a clothed 3D Gaussian Splatting avatar dressed in the target garment, animated via SMPL-X motion sequences estimated from the source video, and embedded into a reconstructed background point cloud in a shared world coordinate space. That proxy then serves as the structural anchor for a video Diffusion Transformer built on the Wan2.1-I2V-14B foundation model, which synthesizes photorealistic output video for any requested camera trajectory.
The paper establishes a new benchmark -- CaM-VVTBench -- with a training set of approximately 60K video clips and a test set of 96 samples across six predefined camera motion types: tilt up, tilt down, zoom in, zoom out, orbit left, and orbit right. On the existing ViViD-S benchmark (180 test samples), TryOnCrafter achieves a paired VFID_I of 9.6085, compared to 11.0180 for the prior best (DreamVVT). The framework also enables downstream applications the paper demonstrates: human relocalization, "bullet time" effects where a static pose is rendered from a moving camera, and 360-degree orbital viewing that existing frameworks could not support.
The honest caveats are two the authors name themselves: extreme viewpoint transitions surface inaccuracies in the SMPL-X body model, occasionally producing misaligned hand poses or geometric inconsistencies; and the iterative DiT denoising process carries inference costs the paper explicitly describes as "hindering real-time interaction for trajectory edits." What the paper does not provide is actual wall-clock inference time per video, nor any indication of whether model weights or code will be released.
For fashion and e-commerce practitioners, Alibaba's co-authorship suggests at least one large platform is tracking this closely. If inference costs come down -- DiT efficiency has been moving quickly -- the gap between try-on from one fixed angle and try-on from any angle closes in a meaningful way for shoppers who want to see the back of a coat before buying it.
Originally reported by huggingface.co
Read the original article →Original headline: TryOnCrafter: First Camera-Controllable Video Virtual Try-On via Renderable 4D Proxy From Chinese Academy of Sciences and Alibaba