paper web signal

Microsoft Research Lifts VLA Robot Success 42% to 76% Zero-Shot

TL;DR

  • An object-pose residual RL policy trained only in simulation lifted a frozen VLA's real-robot success rate from 42% to 76% across five tasks.
  • Per-task gains varied: cube lift went 7/20 to 17/20, drawer closing hit a perfect 20/20, but cup-standing only moved from 5/20 to 8/20.
  • Successful residual-corrected rollouts can retrain the base VLA itself, enabling a self-improvement loop without collecting more teleoperation data.

A new robotics paper hosted on Microsoft Research is the kind of result worth pausing on, because the part that usually breaks sim-to-real transfer essentially didn't happen. A residual reinforcement learning policy trained entirely in simulation, with no real-world fine-tuning, reportedly lifted the success rate of a Vision-Language-Action robot from 42% to 76% across five manipulation tasks on a real Franka Research 3 arm.

The setup is worth describing because it is the trick. The team, Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, and Yasuyuki Matsushita, froze a base VLA and trained a small corrective policy on top of it. Instead of feeding that residual raw images, which is where sim-to-real usually falls apart on the visual domain gap, they conditioned it on object poses, proprioception, and the base VLA's own action. To align the two worlds they replayed the same teleoperation demonstrations inside the simulator to train a sim counterpart of the real VLA, then trained the residual only in simulation with pose noise injection and dropout so the policy would not overfit to a clean simulator.

The per-task breakdown on the project page shows the gain is real but uneven. Cube lifting jumped from 7/20 to 17/20, drawer closing went from 14/20 to a perfect 20/20, pick-and-place from 9/20 to 16/20, stacking from 7/20 to 15/20, and cup-standing only crawled from 5/20 to 8/20. The authors also report that the successful rollouts can be reused to retrain the base VLA without any new human demonstrations, a self-improvement loop that, if it holds, changes the economics of collecting robot data.

The honest caveat is that all five tasks are short-horizon benchtop primitives, and the method assumes you can get a clean object-pose estimate at deployment, which is itself a non-trivial perception problem the paper does not solve. What the reporting does not give you is how the baseline compares to other VLAs, or how the residual behaves on novel object instances the sim never saw.

For anyone running a VLA in the field, the closing thought is that a cheap sim-only post-training pass that nearly doubles task success, without collecting one more real demonstration, is the kind of recipe robotics teams will try to copy quickly.