huggingface.co web signal

MemLearner teaches video world models to query their own memory

TL;DR

  • A learnable-query mechanism replaces rule-based context frame retrieval so video world models can stay consistent under occlusion and dynamic objects.
  • The main experiments use an internal 1B-parameter text-to-video DiT with 28 layers, of which 5 serve as Query Layers; the method also ports to Wan2.1 (T2V-1.3B).
  • In a 27-user study across 13 scenes, MemLearner was preferred 69.51% of the time on quality and 72.93% on consistency versus baselines including CaM, VMem, FramePack and DFoT.

A team from the University of Hong Kong, Fudan, Zhejiang, and Kuaishou's Kling group have posted a paper on Hugging Face arguing that the memory problem in video world models is best solved by teaching the model to fetch its own context, rather than by handing it a rule for which past frames to look at.

The setup will be familiar if you have followed interactive video generation. Video world models predict future frames from history plus user actions, and they lose scene consistency once the camera pans back to something it saw earlier. Existing fixes rely on rule-based retrieval, picking context frames by field-of-view overlap or by matching point clouds. The authors point out those rules break in the messy cases: an occluding wall between two camera views defeats FOV heuristics, and point-cloud matching cannot cleanly reconstruct a moving object.

Their proposal, MemLearner, introduces learnable query tokens (Q) that sit between context tokens (C) and predicted tokens (P) inside the diffusion transformer. Q tokens attend to C tokens to extract what matters for the current frame, and P tokens attend to Q tokens as their generation condition. The choice that seems to do the work is refusing to train a separate query module. When the team tried that alternative, the from-scratch module produced near-zero attention and the DiT collapsed back into a text-to-video model. Reusing the pretrained video DiT itself as the querying network is what lets the mechanism learn end-to-end from the diffusion loss alone.

The main experiments run on an internal 1B-parameter DiT with 28 layers, five of which serve as Query Layers, at 640 by 352 resolution and 77 frames. Training uses over 20,000 iterations at batch size 8, on a mix of a custom Unreal Engine dataset with staged occlusions and dynamic objects, plus real footage from Sekai and SpatialVID. The paper also ports the method to the open-source Wan2.1 T2V-1.3B model. In a small user study (27 raters, 13 scenes) MemLearner was preferred over baselines including CaM, VMem, FramePack and DFoT, with quality preference at 69.51% and consistency preference at 72.93%.

The honest caveat is that the user study is small, and the internal 1B backbone is not something readers can reproduce. On the open Wan2.1 model the absolute numbers drop noticeably, and the paper does not compare directly against 3D-reconstruction-based memory approaches on the same benchmark, nor does it report wall-clock or memory-footprint numbers for the query-token overhead at long contexts. The bet worth watching is whether the let-the-DiT-be-its-own-memory-module pattern gets picked up by open video stacks; the Wan2.1 result suggests it will port.