NeuWorld walks through scenes via fixed-length implicit state
TL;DR
- NeuWorld replaces growing video-latent rollouts with a fixed-length 1024-token Neural Implicit Scene sampled by a diffusion transformer and rendered by a frozen decoder.
- On the Re10K cycle protocol NeuWorld runs a forward-and-return trajectory in 3.24 minutes, about 14× faster than VMem and Gen3C at 47.62 minutes each.
- NIS-VAE and NIS-DiT were trained from scratch on Re10K and DL3DV-10K using 16 A100 GPUs for roughly one week, with no pretrained video backbone.
Camera-controlled world models that roll out video latents keep hitting the same wall: rolling out future frames entangles state transition with high-frequency appearance synthesis, and long-horizon consistency degrades as the trajectory grows. A new paper surfaced on Hugging Face's papers feed, Walking in the Implicit, proposes a different rollout variable, and the authors, from Zhejiang University, Westlake University and Afari Intelligent Drive, build a system called NeuWorld around it.
The idea is to make the rollout state a fixed-length, renderable token set they call a Neural Implicit Scene, or NIS. An NIS-VAE encodes sparse posed views into NIS tokens and decodes target views; an NIS-DiT, a set-based diffusion transformer, samples the next local NIS state under camera and history conditions. Each interaction step is factorized into a generative transition in NIS space and pose-conditioned rendering from the sampled state. The main configuration uses 1024 tokens at 64 channels, with images at 256×256, and both pieces are trained from scratch on the public Re10K and DL3DV-10K posed-view datasets on 16 A100 GPUs for roughly one week. No pretrained video backbone, no auxiliary 3D reconstructor.
The number that makes this more than a representation note is the cycle-revisitation cost, which tests whether the model can return to previously visited regions. According to the arXiv version of the paper, NeuWorld runs a forward-and-return Re10K trajectory in 3.24 minutes, about 14× faster than VMem and Gen3C at 47.62 minutes each under the same evaluation runner. On DL3DV the same protocol takes 1.14 minutes, second only to Matrix-Game 2.0, which the authors note is a distilled few-step diffusion model. On image and pose metrics NeuWorld reports the lowest pose errors at Re10K's 200th frame and the best long-horizon translation consistency on DL3DV at the 80th frame, against baselines including VMem, SEVA, Gen3C, ViewCrafter and Matrix-Game 2.0.
The honest caveats are several and the paper flags them. The evaluation is on static scenes only, which the authors describe as an isolation choice for the representation question, not a claim about dynamic worlds. Pose errors come from an external estimator and are labeled pose-consistency proxies rather than direct camera-control measurements. What the reporting does not yet give you is behavior above the 256×256 training crop, or memory-bank scaling cost over sessions much longer than the 200-frame and 80-frame horizons. If NIS rollout holds up, the interesting bet for interactive world models is that the right unit is neither the next video latent nor a metric reconstruction, but something compact and renderable that sits between them.
Originally reported by huggingface.co
Read the original article →Original headline: BAAI Releases 'Walking in the Implicit': Interactive World Exploration via Neural Scene Representation Enables Real-Time Navigation Through Generated Environments Without Explicit Geometry