Relative Wrist Translation Triples Bimanual Robot Success Rate
TL;DR
- Dropping noisy 6DoF hand-pose estimates in favor of relative wrist translation tripled task success, from 12.50% to 38.33%.
- Pre-training used roughly 600 hours of human video, including ~500 hours of outsourced free-form household manipulation.
- Performance scales with human data volume, pointing to cheap video as the expansion lever rather than more robot teleoperation.
For years the promise of training robots directly on human video has run into a specific, mundane obstacle: reconstructing precise 6DoF hand poses from video is noisy, and human fingers bear little kinematic resemblance to parallel robotic grippers. A new preprint on arXiv from Sijin Chen and colleagues proposes a disarmingly simple fix -- stop trying to transfer rotation at all.
The core idea is to define a shared action space using only relative wrist translation within an initial head-camera coordinate frame. Both a human wrist and a robot end-effector move through space; the translation part of that motion can be meaningfully shared without either side having to agree on orientation. The researchers call this the bridging action, and the performance difference is concrete: a model trained with this bridging objective reached a 38.33% overall success rate across 15 bimanual manipulation tasks, compared to 12.50% for an otherwise equivalent model trained without it.
The pre-training dataset assembled for this work runs to roughly 600 hours of human action video -- around 70 hours drawn from EgoDex, around 500 hours of outsourced free-form household manipulation, and around 45 hours of in-lab recordings. Robot co-training adds roughly 72 hours of pick-and-place data across 100 objects. The model itself is approximately 4 billion parameters, built on a Mixture-of-Transformer architecture using flow matching for action generation. The task suite covers microwave operations, drawer handling, mug hanging, cup stacking, straw insertion, and charger unplugging, evaluated across two scenes per task.
The honest caveat is that tripling the baseline still leaves a majority of attempts failing. The paper also does not address tasks where wrist rotation is the actual operative variable -- screwing a cap, inserting an angled connector -- and it is an open question whether translation-only bridging is sufficient for that class of manipulation. What the source does not give you is a direct comparison against a model trained on equivalent hours of full robot teleoperation, so the remaining performance gap is unquantified.
What the paper does establish is a scaling relationship: performance improves as more human data is added. For teams building bimanual systems on constrained budgets, that is the most actionable finding. The expansion path becomes adding cheap household video rather than proportionally more expensive robot demonstration time.
Originally reported by paper
Read the original article →Original headline: Shared Wrist-Translation Coordinate Frame Bridges Human Video to Bimanual Robot Policies