V-Zero Trains Visual Reasoning Without Labels, 10x Faster Than RL
TL;DR
- V-Zero improves a Qwen3.5-4B model by 3.1 points on average across fine-grained visual benchmarks without any annotated answer labels.
- Training runs more than 5x faster than supervised fine-tuning and more than 10x faster than reinforcement learning baselines.
- The method uses paired positive and negative image crops only during training; inference requires just the standard full image.
The expensive part of teaching a multimodal model to reason carefully about images has always been the labels -- either large sets of annotated reasoning traces for supervised fine-tuning, or the costly exploration and reward signals that reinforcement learning requires. A paper posted on Hugging Face proposes V-Zero, a training framework that skips both, using only pairs of image crops -- one focused on the question-relevant region, one drawn from an irrelevant area -- to decide which of the model's own reasoning attempts are worth learning from.
The mechanism is called contrastive evidence gating. During training, a teacher model replays each student-generated reasoning trajectory twice: once with a crop centered on the relevant image region, and once with an equal-size crop sampled from a downsampled irrelevant part of the same image. If a trajectory receives stronger teacher support under the relevant crop than the irrelevant one, it earns a higher weight in the distillation loss -- the visual evidence gate amplifies well-grounded rollouts and suppresses drifting ones. The student never sees these crops at test time and still processes just the full image, so inference is unchanged.
This framing addresses a known limitation of standard on-policy distillation: it provides dense token-level corrections but cannot assess whether a full reasoning trajectory is heading toward a correct answer. The contrastive gate adds that trajectory-level signal without requiring any textual answer labels or predefined reward rules.
Experiments use Qwen3.5-4B as the student and Qwen3.5-27B as the teacher, trained on 23K curated samples from a prior dataset, across benchmarks including VStar, HR-Bench at 4K and 8K resolution, ZoomBench, and MMStar. V-Zero improves the 4B backbone by an average of 3.1 points, with gains of +4.7 on VStar, +3.4 on HR-4K, +2.0 on HR-8K, and +5.5 on ZoomBench. According to the paper, this comes at more than 5x lower training cost than supervised fine-tuning methods and more than 10x lower than reinforcement learning baselines -- though the paper itself notes these comparisons are conservative, since the SFT and RL baselines it compares against were trained on H100 GPUs while V-Zero used RTX PRO 6000 hardware with weaker BF16 throughput.
What the paper does not fully account for is the upstream effort in those 23K training examples: each contains a question-relevant regional crop that V-Zero uses as its positive view. That annotation still has to come from somewhere. For teams that already have structured image-question datasets with region annotations, V-Zero offers a lighter path to fine-grained visual reasoning than RL would require -- and the code and dataset are due for public release at the project's GitHub repository.
Originally reported by huggingface.co
Read the original article →Original headline: V-Zero: Answer-Label-Free On-Policy Distillation With Contrastive Evidence Gating Improves Fine-Grained Visual Reasoning at 5× Lower Training Cost Than SFT and 10× Faster Than RL Baselines