paper web signal

TAP Pretraining Hits 25% Camera Robustness Where VLAs Collapse to 0%

TL;DR

  • A new ICML 2026 paper introduces TAP, splitting motor-skill pretraining from language grounding so VLAs can learn from cheap unlabeled robot data.
  • On real WidowX robots, TAP reports 25% task success under camera perturbations while internet-scale baselines collapse to 0%.
  • The authors claim TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data.

There is a quiet argument buried in a new ICML 2026 paper out of Fudan, Learning to Move Before Learning to Do, and it is more interesting than the headline number. Junhao Shi and coauthors claim that the reason Vision-Language-Action models are so hungry for expert teleoperation data is that we are asking them to learn two very different things at once: how to physically move, and how to map language onto a task. Their proposal, Task-Agnostic Pretraining (TAP), splits those apart.

The recipe, as described, is two-stage. First, an unsupervised motor-learning phase on cheap unlabeled data, off-task trajectories and autonomous robot play, trained with a self-supervised Inverse Dynamics objective. Then a much smaller language-grounding phase on top, using minimal expert demonstrations. The authors' pitch is that most of the expensive labeled data current VLAs consume is being spent teaching motor skills that did not actually need supervision.

The results they lead with are eye-catching. On the SIMPLER benchmark they report a 10% absolute gain over standard behavior cloning. On real WidowX arms under camera perturbations, TAP reportedly holds 25% success while internet-scale baselines collapse to 0%. And they claim to match models trained on over 1M expert trajectories while using, in their words, orders of magnitude less labeled data. Take the specifics as reported by a single group, not as settled, since these are the authors' own numbers on their own setup.

What the arXiv abstract does not give you is which internet-scale VLAs actually hit that 0% number, at what perturbation magnitudes, or how many labeled demonstrations TAP itself needed to reach parity. Nor does it say anything about transfer to arms other than the WidowX, or to bimanual and mobile platforms where the motor-skill assumption may not be as clean.

Still, the decomposition is the part worth watching. If learning to move really is separable from learning to follow instructions, the smaller labs and startups that cannot afford million-trajectory datasets suddenly have a credible pretraining path, and the cost curve for building competent household or lab robots gets meaningfully friendlier.