PixelEyes targets VLM 'inattentional blindness' in visual search
TL;DR
- PixelEyes decouples a reasoner from a mask-guided perception tool, cutting the redundant search loops that bloat VLM multi-turn trajectories.
- On the V* benchmark, PixelEyes-8B reaches 94.24 percent versus 86.39 percent for the Qwen-3-VL-8B baseline it is built on.
- The authors introduce Pinpoint-Bench, a 433-sample zero-hint visual search benchmark that separates localization failures from recognition failures.
A new paper on arXiv makes a small but useful move for anyone building vision agents. It names a specific failure mode in multimodal LLMs, gives you a benchmark that can actually distinguish it from other failures, and then shows an architectural fix that lands double-digit gains on top of a strong open baseline.
The failure mode is what the authors call inattentional blindness: the agent successfully looks at the right region of an image but still fails to recognize or use what it saw. Their diagnostic is the gap between two numbers, a Localization Success Rate and a task accuracy. If the model is finding the target but not answering correctly, you have a perception bottleneck rather than a reasoning one. On their Pinpoint-Bench, a 433-sample zero-hint visual search benchmark on high-resolution images, that gap is measurable.
The fix, called PixelEyes, is architectural rather than prompt-level. Instead of asking one model to reason about the question and localize the evidence at the same time, the paper splits the two: a reasoner decides what to look for, and a mask-guided perception tool built on referring segmentation decides where. A breadth-first search over semantic regions keeps the agent from re-cropping the same wrong sub-image over and over. Applied on top of Qwen-3-VL, the reported results are large. PixelEyes-8B reaches 94.24 percent on V*, against an 86.39 percent baseline, and PixelEyes-4B jumps 19.81 points on VisualProbe Hard. Even the 4B system outperforms the 8B baseline it was built on.
Why a practitioner should care: the implicit story of the last year has been that bigger multimodal models close visual gaps automatically. A result like this argues the opposite, that for evidence-seeking tasks a smaller, well-plumbed system can beat a bigger monolith. If you are running a UI agent or a document reader in production, the shape of the architecture is copyable.
The honest caveats are that the numbers are self-reported and the diagnostic benchmark is only 433 samples, which is small for a claim about failure modes. What the reporting does not give you is the latency picture, whether a separate segmentation tool per turn is affordable at scale, or how the decoupled setup holds up on video or streaming input. Those are the parts to watch as other teams try to reproduce the recipe.
Originally reported by paper
Read the original article →Original headline: PixelEyes (ECCV 2026): VLMs Suffer Inattentional Blindness in Visual Search — Decoupled Perception Tower Fixes It