HOPformer beats hand-object pose SOTA on ARCTIC by 6.2 points
TL;DR
- EPIC-Contact provides 2.3K clips and 62.3K frames of in-the-wild egocentric footage with dense, bijective 3D hand-object contact correspondences and posed meshes.
- HOPformer is an end-to-end transformer that jointly predicts bi-manual hand and object pose in a single forward pass using a cross-attention decoder.
- The model reaches 82.4% success rate on ARCTIC, 6.2 points above prior state of the art, and nearly doubles success rate on EPIC-Contact while cutting contact deviation by 75%.
Hand-object pose estimation from head-mounted cameras has been stuck on a familiar gap: models trained in a lab often do not survive contact with real kitchens, workshops, or hands holding things at awkward angles. A new paper on arxiv from Bansal, Zhu, Tripathi, Zhao, Black and Damen proposes both a dataset and an architecture aimed squarely at that gap.
The dataset, EPIC-Contact, is described as an in-the-wild egocentric collection of 2.3K clips totaling 62.3K frames, with dense, bijective 3D hand-object contact correspondences and posed meshes. The model, HOPformer, is an end-to-end transformer that jointly predicts bi-manual hand and object pose in a single forward pass, with a cross-attention decoder that conditions object features on hand priors. On the in-lab ARCTIC benchmark, the authors report an 82.4% success rate, a 6.2 point gain over current state of the art. On their own EPIC-Contact set, they say HOPformer nearly doubles the success rate and cuts contact deviation by 75%.
Why this matters: reliable egocentric hand-object tracking is the perception layer sitting underneath things people actually want. Robot policy learning from human demonstration, AR overlays that annotate what your hands are doing during a repair or a recipe, hand-driven controls on headsets — all of them need to know where the fingers are and where they touch the thing being held, from a first-person camera, without a lab rig. A dataset with dense contact labels rather than sparse keypoints is the more useful shape for that downstream work.
The honest caveat is that these are the authors' reported numbers on the benchmark they themselves introduced, so "nearly doubles" is a claim, not yet a reproduced result. The abstract does not give latency, model size, or how HOPformer degrades on out-of-distribution objects or when hands leave frame. Success rate is a thresholded metric, and the paper's own release is currently the only source.
The forward-looking bit is that the code, checkpoints, and dataset are being released, which is what turns a benchmark win into something robotics and AR teams can actually try.
Shared on Bluesky by 2 AI experts
-
*NEW* Our #ECCV2026 @eccv.bsky.social paper Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation Now on ArXiv w Dataset, Code&model sid2697.github.io/epic-contact/ arxiv.org/abs/2606.30598 Two contributions: 1. …
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation