World Model Hallucinations Linked to Data Coverage Gaps
TL;DR
- Three unsupervised hallucination detectors achieve Spearman correlation of roughly 0.80 with rollout error, needing no labels.
- Coverage-aware training, reweighting data by task rather than by frame count, yields a +0.88 dB rollout PSNR improvement.
- A 350M-parameter world model adapts to entirely unseen environments with only 50 trajectories using curiosity-driven data collection.
When a world model produces video that looks convincing but contradicts the actual physics of the environment, the standard assumption is that something is wrong with the architecture. A paper published on Hugging Face argues the problem is more tractable than that: hallucination concentrates in low-coverage regions of the state-action space, which means it can be predicted at runtime and fixed through better data rather than architectural redesign.
The researchers identify three distinct failure modes. Perceptual hallucination lives in the tokenizer, which projects out-of-distribution scenes onto familiar training exemplars; an unseen maze layout might be reconstructed with correct agent and goal positions but with the walls of an entirely different layout seen during training. Action marginalization means the dynamics model ignores the input action and collapses onto a plausible-looking but action-agnostic future, making the model behave more like a video generator than a controllable world model. Scene divergence is the compounding error in autoregressive rollouts, producing physically implausible events like a ball teleporting back into play in Pong. The paper introduces three unsupervised runtime predictors that detect each failure mode without labels or additional training, achieving a Spearman correlation of roughly 0.80 against rollout error across all three.
To support the work, the researchers release MMBench2, a dataset of 65,600 trajectories spanning 427 hours of video across 210 continuous-control tasks in 10 domains, with ground-truth action and reward labels and live simulators for every task. It is the kind of broad benchmark that surfaces coverage gaps which smaller evaluations tend to hide.
The two practical fixes follow directly from the diagnosis. Coverage-aware training reweights the existing dataset to sample uniformly across tasks rather than frames; applied to both tokenizer and dynamics model, it yields a rollout PSNR gain of +0.88 dB at no extra compute. For entirely unseen environments, curiosity-driven data collection uses the hallucination predictors as exploration rewards, scores candidate trajectories by predicted hallucination, and executes the highest-ranked one in a live simulator. With just 50 real trajectories per task, this approach adapts a 350-million-parameter pretrained model to new environments, reaching task performance close to what human-collected data achieves.
The authors are upfront about the limits: all of this was validated at 350 million parameters across simulated control tasks, and whether the findings translate to billion-parameter models or real robot hardware with sensor noise and partial observability is an open empirical question. What the reporting does not give you is any evidence from physical robots or from tasks where clean action labels are unavailable. For teams training world models on simulation data today, though, the coverage-aware reweighting recipe and label-free runtime detectors are immediately applicable.
Originally reported by huggingface.co
Read the original article →Original headline: Hallucination in World Models Is Predictable and Preventable — Three Unsupervised Detectors Achieve ρ ≈ 0.80 With Rollout Error, New 65K-Trajectory MMBench2 Released