paper web signal

Video-MME-Logical: 500K Samples Fail to Close MLLM Reasoning Gap

TL;DR

  • Video-MME-Logical organizes 25 task categories around five operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition.
  • Supervised fine-tuning on up to 500K generated samples improved scores but did not close the human-model gap on video temporal-logical reasoning.
  • The benchmark supports intermediate-state diagnostics, checking whether a model recovers the required logical reasoning trace before producing its final answer.

A new arxiv preprint asks whether video multimodal LLMs are actually reasoning across frames or just doing competent single-frame recognition with a temporal coat of paint. The answer, according to the paper, is that current models still have a substantial gap to humans on what the authors call video temporal-logical reasoning, and throwing more supervised fine-tuning at the problem does not close it.

The benchmark, Video-MME-Logical, is built around five operations the authors argue a model needs to reason over evolving visual evidence: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. These expand into 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. The point of that controlled construction is to separate genuine cross-frame reasoning from the static recognition and uncontrolled temporal variation that, as the abstract puts it, existing video benchmarks tend to conflate with reasoning.

The headline result is the one worth sitting with. Supervised fine-tuning on up to 500K generated samples improves performance but, in the authors' words, remains insufficient to close the reasoning gap, and the gap widens as temporal-logical complexity increases. If half a million tailored examples don't close it, the natural read is that the bottleneck isn't data volume, it's something about how these models maintain, update, and compose evidence as visual states evolve across frames. The honest caveat is that the abstract does not name which MLLMs were tested, does not report numerical scores, and does not give a human baseline, so 'state-of-the-art' is doing a lot of work here and the picture could look different across model families.

What makes the contribution useful even with that caveat is the diagnostic angle. The benchmark supports intermediate-state evaluation, verifying whether a model recovers the required logical reasoning trace before producing the final answer, not just whether the answer is right. If other groups pick this up and run it against named frontier systems, it could become a cleaner way to tell real architectural progress on video reasoning apart from yet more synthetic data.