paper web signal June 26th 2026

PerceptionRubrics Audit Pins Open-vs-Closed Vision Gap at 8%

TL;DR

PerceptionRubrics pairs 1,038 information-dense images with over 12,000 atomic rubrics split into 4,232 Must-Right and 7,772 Easy-Wrong criteria.
Across 25 evaluated models, a persistent 8% perception deficit separates open-source frontier systems like Qwen3.5 from proprietary leaders such as Seed-2.0.
A gated scoring mechanism applies sharp binary penalties when a model misses a mandatory visual fact, rather than averaging errors away.

Standard multimodal benchmarks have been telling a story where open-source vision-language models have basically caught up to the proprietary frontier on reasoning-heavy tasks. A new arXiv paper called PerceptionRubrics argues the story breaks down once you audit perception one atomic fact at a time, and pins the remaining gap at roughly 8%.

The setup is worth understanding, because it is what earns the number. Instead of scoring a caption as a whole against a reference, the authors, in their arXiv preprint, pair 1,038 information-dense images with over 12,000 instance-specific rubrics: 4,232 "Must-Right" essential facts that a caption has to nail, and 7,772 "Easy-Wrong" fine-grained details. A gated scoring mechanism then applies a sharp binary penalty when a model misses a mandatory visual fact, rather than letting it average away against the things it got right. The rubrics are distilled from golden captions built through what the paper calls a Circular Peer-Review consensus pipeline.

Across 25 evaluated models, a persistent 8% perception deficit separates the open-source frontier, with Qwen3.5 as the example the paper uses, from proprietary leaders such as Seed-2.0. The authors frame the pattern as a Reliability Gap: models often verify fragmented elements correctly in isolation but fail when a scene demands they get several fine visual facts right at once, which the paper says shows up sharply in information-dense domains like GUIs.

The honest caveat is that this is one paper from one group, and how well any rubric benchmark generalizes past its own image and prompt distribution is always the open question with new evaluation frameworks; the retrieved material also does not detail how the 25 models were prompted or whether the deficit holds evenly across image types. Still, if the read is right, it suggests some of the "open catches closed" headline from the last year of leaderboards has been driven by reasoning improvements while raw visual precision quietly stays a proprietary advantage, which is exactly the corner practitioners deploying open VLMs on dense visuals should probe themselves before they trust benchmark parity.

Originally reported by paper

Read the original article →

Original headline: Atomic Rubric Audit Reveals Persistent 8% Open-vs-Closed Vision Gap That Reasoning Scores Miss