huggingface.co web signal

Foresight Detects Robot Failures Using World-Model Predictions

robotics computer vision robotics-research world-models failure-detection

TL;DR

  • Foresight detects robot task failures using only trajectory-level success/failure labels, requiring no frame-by-frame temporal annotations.
  • On BEHAVIOR-1K, where tasks average 8,557 steps, Foresight outperforms the best baseline by 0.14 in balanced accuracy.
  • The policy-agnostic framework was validated across six policies including ACT, OpenVLA, π₀-FAST, π₀.5, SmolVLA, and GR00T N1.5.

When a robot arm works through a long, multi-step manipulation task spanning hundreds or thousands of timesteps, knowing precisely when something has gone wrong is harder than it sounds. Failures tend to emerge gradually, and labeling every frame with a temporal annotation of failure onset is expensive. Researchers from the University of Michigan, Princeton University, and the University of Virginia address this with Foresight, a failure detection framework that requires only final trajectory-level success or failure labels to train.

The method uses V-JEPA 2-AC as a backbone world model. Given a robot's current observations and planned actions, the world model predicts what the latent state should look like in the near future. A causal Transformer synthesizes these predicted latents across time to produce a per-timestep failure score, and functional conformal prediction calibrates adaptive thresholds that account for how failure probability shifts across a trajectory's timeline.

Across three simulation benchmarks, the gains are most pronounced on the hardest setting. On BEHAVIOR-1K, where tasks average 8,557 steps, Foresight-Transformer outperforms the best baseline by 0.14 in balanced accuracy. Real-robot experiments on a ReactorX-200 arm and a Franka arm, across four tasks and policies including ACT, π₀.5, SmolVLA, and GR00T N1.5, produce ROC-AUC scores between 0.79 and 0.93.

The paper is candid about the cost. Pretrained world models carry significant computational weight and latency, making on-device deployment challenging and potentially excluding tasks that require fast closed-loop control. The conformal calibration guarantees also depend on the held-out calibration distribution matching what the robot encounters during deployment, which will not always hold in practice. What the paper does not address is how much data or compute would be needed to recalibrate thresholds as deployment conditions shift over time.

The design choice that matters most for practitioners is that Foresight treats the policy as a black box, requiring no access to policy internals. That makes it, in principle, a drop-in safety monitoring layer that could sit alongside existing deployed robot systems without modification.