Paper: VLMs Overestimate Common Ground in Asymmetric Dialogue
TL;DR
- A SIGDIAL 2026 paper tests five VLMs on an interpretation-matching task built from 13,077 annotated reference expressions in HCRC MapTask dialogues.
- The models conflate what partners could share with what has actually been grounded, most clearly in Qwen3-VL-8B-Instruct across two architecture families.
- Providing authentic map images improves overall performance but shifts the models further toward over-predicting alignment between speakers.
There is a subtle failure mode in vision-language models that a new arXiv paper from Nan Li, Albert Gatt and Massimo Poesio digs into, and it matters more than the average VLM benchmark result. When two people are working from different maps and talking through a task, a good listener has to track what has actually been established between them, not just what is in principle visible on either side. The paper's claim is that current VLMs do not really do this. They treat what could be shared as if it had been shared.
The setup, accepted to SIGDIAL 2026, is an interpretation-matching task built on 13,077 annotated reference expressions from the HCRC MapTask dialogues, a long-standing corpus of asymmetric map-navigation conversations. Five models are tested across two architecture families. The bias shows up most clearly in Qwen3-VL-8B-Instruct but appears, to varying degrees, in the four other models the authors evaluate.
One counter-intuitive result: giving the model the actual map images improves overall performance, but pushes it further toward over-predicting alignment. The bias holds whether the map content is delivered visually or textually, which suggests the models are treating map content as evidence of mutual understanding rather than tracking how grounding is built up turn by turn. The authors describe this as relying on static referential cues on the maps rather than following dialogue history.
The honest caveat is that this is one paper, on one task family, using five specific open models. The frontier hosted systems are not in the comparison, and the paper does not tell you whether targeted fine-tuning or a more grounding-aware prompt fixes the behavior. What the reporting doesn't give you is a fix, only a diagnostic.
For anyone building assistants that work alongside people, or multi-agent systems that need to stay coordinated, the useful thing here is exactly that diagnostic. If your VLM is confidently answering questions about a partner's map before the conversation has actually established that content, that is a specific, testable failure, and MapTask-style asymmetric dialogue is a reasonable place to start looking.
Shared on Bluesky by 1 AI expert
Originally reported by paper
Read the original article →Original headline: VLMs Systematically Overpredict Shared Understanding in Dialogue, Study Finds