huggingface.co web signal

BAAI's Orca-4B world model edges Qwen3.5-4B on video tests

TL;DR

  • Orca pretrains on 125K hours of video and 160M event annotations using a single Next-State-Prediction objective on a frozen Qwen3.5 backbone.
  • Orca-4B averages 51.8 across MVBench, TemporalBench, 3DSRBench, and SWITCH, ahead of Qwen3.5-4B's 46.7 at the same size.
  • On a real-robot out-of-distribution test Orca reports 36.6% versus π₀.5's 27.6%, despite using no action labels in pre-training.

A new paper out of Beijing Academy of Artificial Intelligence quietly makes a bigger architectural claim than the usual leaderboard release. Their model, Orca, is trained on a single objective the authors call Next-State-Prediction modeling — one shared latent for what comes next, whether the next thing is a token, a video frame, or a robot action.

The setup is two complementary kinds of learning sitting on a frozen Qwen3.5 backbone. 'Unconscious learning' captures dense natural state transitions from continuous videos, and 'conscious learning' models sparser, language-described events plus VQA supervision. Pre-training data is reported at 125K hours of video and 160M event annotations, run at 0.8B and 4B parameter sizes.

The numbers, according to the paper page on Hugging Face, are the eyebrow raisers. Orca-4B averages 51.8 across MVBench, TemporalBench, 3DSRBench, and SWITCH, against 46.7 for Qwen3.5-4B at the same size. On the PRICE-V0.1 image-prediction benchmark it lands at 59.8 versus FLUX.2 klein at 56.1. On a real-robot out-of-distribution test it reports 36.6% against π₀.5's 27.6%, despite using no action labels in pre-training.

The honest caveat is that almost all of this is single-sourced from the authors. PRICE-V0.1 and SWITCH are unfamiliar enough that the headline gaps deserve a reported-not-settled read until someone outside BAAI reproduces them. The paper itself concedes the model used only about a tenth of the collected data, that supervision sits in a frozen ViT space rather than native multi-source learning, and that scale is capped at 4B for resource reasons. None of that is fatal, but it does mean the loss-curve story is early.

What I would actually watch is whether the weights ship, and whether one objective for video, image, and action is something a small lab can pick up and finetune. If it is, the case for stitching three specialised stacks together for video understanding, image prediction, and robot control starts to weaken.