paper web signal

SimFoundry Predicts Real Robot Policy Success From Sim at 0.911 Correlation

TL;DR

  • SimFoundry constructs simulation environments zero-shot from a single video, requiring no manual scene authoring.
  • Across 7 manipulation tasks and 5 policy architectures, sim evaluations correlated with real-world performance at mean Pearson 0.911.
  • Policies trained on task-variation digital cousins achieved 40% average success rate improvement in zero-shot real-world transfer.

Benchmarking robot policies in the physical world is slow, expensive, and hard to scale. Each evaluation requires robot time, careful environment resets, and human supervision. A new paper introducing SimFoundry, available as a preprint on arXiv, attacks that bottleneck by building simulation environments automatically from video and then asking a pointed question: can those simulations replace physical evaluation for ranking which policies are actually worth deploying?

The headline result is a mean Pearson correlation of 0.911 between simulation evaluation scores and real-world performance, measured across 7 manipulation tasks and 5 policy architectures. A mean maximum ranking violation of 0.018 suggests the system rarely scrambles the relative order of policies. That is a strong enough signal to treat simulation as a reliable proxy for physical testing, not just a useful training environment.

The pipeline works by taking a video of a real scene and constructing a digital twin automatically with no manual authoring. From that base twin, the system generates what the authors call "digital cousins" -- affordance-preserving variations of the original objects, scene layout, and tasks. Policies trained on those variations showed average success rate improvements of 17% for object cousins, 21% for scene cousins, and 40% for task cousins in zero-shot real-world transfer.

The honest caveat is that 7 tasks is a limited sample, and mean correlation figures can obscure individual tasks where the sim-real gap is much larger. The evaluation centers on manipulation; whether the reconstruction quality and correlation numbers hold for more varied robotic settings is a question the paper does not address. What the paper also does not give you is a reported failure rate for the automated reconstruction pipeline itself -- how often a single video produces a sim environment good enough to train on reliably.

If the correlation holds broadly, the practical implication is clear: teams could run most policy ranking experiments in simulation and reserve physical robot time for final validation of top candidates rather than full sweeps. For research groups with limited hardware access, that shift in where evaluation happens could be as consequential as the policies themselves.