huggingface.co web signal

AMVL from SJTU and Ant lifts BLINK reasoning score by +10.83

TL;DR

  • Shanghai Jiao Tong and Ant Group's AMVL swaps discrete chain-of-thought for eight continuous latent slots between the prompt and target answer.
  • On Qwen2.5-VL-7B-Instruct, the method reports +10.83 average on BLINK and gains up to +32.00 on the topology-heavy Jigsaw task.
  • A new reverse KL term regularizes the training-time posterior against 'answer leakage', which the paper formalizes as prior contamination.

There is a quiet argument in multimodal AI about whether the 'think step by step' trick that works so well for text is actually the right tool for visual reasoning. A new paper from Shanghai Jiao Tong University and Ant Group, posted to Hugging Face, makes a specific case for the answer being no, and it is worth reading if you follow how MLLMs are being trained.

The method they propose, Asymmetric Mutual Variational Learning or AMVL, keeps the reasoning out of language entirely. Rather than emitting a discrete chain of thought tokens between the image and the answer, the model reasons in a small block of continuous latent slots (they use eight of them) sitting between the prompt and the target. Layered on top of Qwen2.5-VL-7B-Instruct, the authors report a +10.83 improvement on the average score of the complex BLINK reasoning benchmark, with individual gains up to +32.00 on the topology-heavy Jigsaw task. They also evaluate on V*, HRBench4K and HRBench8K, and use VisualPuzzles as their out-of-distribution check.

The reason the paper spends real theoretical effort on it is a problem they call answer leakage. Standard variational training for latent reasoners lets the training-time posterior peek at the ground-truth answer, so the prior it learns to imitate is contaminated by information it will never have at inference time. AMVL adds a reverse KL term on top of the usual forward KL, so the prior and the posterior calibrate each other rather than the prior chasing an oracle it cannot reproduce at test time. The authors formalize this as prior contamination and argue their dual-KL objective provably reduces it.

The honest caveat is the usual one for a single-lab result. Everything is at 7B on a fixed base model, the strongest headline number is a single subtask, and the paper is upfront that variance-level leakage and scaling beyond 7B are left to future work. Take the specifics as reported, not settled. What the reporting does not give you is any measurement of wall-clock inference cost against discrete-reasoning baselines like Vision-R1 or DeepEyes, or whether AMVL degrades plain vanilla vision tasks that never needed a reasoning head in the first place.

If the result holds up in a second lab, the interesting downstream effect is that a small continuous latent block is a much cheaper reasoning surface than a long chain of emitted tokens, which is exactly the constraint anyone shipping vision models under a latency budget is trying to solve.