huggingface.co web signal

ViDiHand uses video diffusion for 4D hand reconstruction

TL;DR

  • ViDiHand adapts a pretrained video diffusion model to reconstruct 4D two-hand pose from egocentric video using a hand-overlay rendering objective.
  • The pipeline operates on full frames with no detector, no motion infiller, and no test-time optimization, evaluated on ARCTIC, HOT3D, and HOI4D.
  • The authors frame the result as a route to scalable in-the-wild data collection for embodied AI, leaning on priors learned at internet-scale video.

A new preprint posted to Hugging Face by researchers at Nanyang Technological University and Shanghai Jiao Tong University argues something that sounds like a stretch until you read the abstract: the right place to look for better hand-tracking from monocular video is not a more carefully engineered hand pipeline, but the representations a video diffusion model has already absorbed from training to generate coherent video.

The system is called ViDiHand. It adapts a pretrained video diffusion backbone with what the authors describe as a "hand-overlay rendering objective" that specializes features for hands while preserving the model's world priors, then trains a decoder to recover metric-scale pose from those adapted features. The framing the authors use for the pipeline is striking: "no detector, no infiller, and no test-time optimization." That matters because the existing literature in this corner of computer vision is mostly a chain of those exact components, and each is a place where occlusion or hand-object contact tends to break things.

The motivating intuition is straightforward once stated. Video generative models trained to synthesize coherent video at internet scale must implicitly acquire motion dynamics, occlusion reasoning, and hand-object interaction. Specialized hand-pose pipelines, by contrast, learn from scarce hand-pose annotations, which the authors call "a narrow signal insufficient to model" those dynamics. ViDiHand is the bet that reusing the bigger model's priors beats inventing a smaller specialist. On three egocentric benchmarks, ARCTIC, HOT3D, and HOI4D, the paper reports that ViDiHand "substantially outperforms prior methods."

The honest caveat is that this is a single preprint, not yet independently reproduced, and the reported numbers come from the authors' own evaluation protocol. Inference cost is also not free: the writeup lists 5.5 fps on four A100 GPUs, which is firmly offline annotation territory rather than realtime input for an AR headset or a teleoperated hand. The reporting also doesn't tell you how the method behaves outside egocentric video or how robust the win is to distribution shift beyond the three named datasets.

If the result holds up, the more interesting implication is strategic rather than numerical. The bottleneck in articulated motion capture from monocular video may be less about building better hand-specific models and more about figuring out which foundation video models to repurpose and how to adapt them cheaply. The authors explicitly frame their work as "a promising route to scalable in-the-wild data collection for embodied AI," which is the downstream they have in mind.