paper web signal

ACL 2026: Chain-of-Thought Hurts Multimodal Visual Grounding

TL;DR

  • A study across 12 tasks, 14 non-reasoning models, and 8 reasoning models finds CoT actively degrades visual grounding and object counting performance.
  • CoT remains genuinely helpful for mathematical, scientific, and multi-image reasoning but is described as 'not a free lunch' for perception tasks.
  • Existing multimodal reasoning models show only marginal overall improvements, possibly because training overemphasizes mathematical reasoning.

Chain-of-thought reasoning has become nearly automatic in modern multimodal deployments, but a paper accepted at ACL 2026 argues that default is wrong for a meaningful class of tasks. Across an evaluation spanning 12 tasks, 14 non-reasoning models, and 8 reasoning models, the researchers found CoT creates "undesirable side effects, such as reduced performance in visual grounding and object counting," the kind of pixel-level perception work that many real-world pipelines depend on.

The paper's title, "Look Light, Think Heavy," names the mechanism the authors identify. During reasoning, "visual reflection consistently diminishes" even as verbal reasoning ramps up. The model's extended chain of thought displaces, rather than supplements, close visual attention. For tasks like locating objects in an image or counting items in a scene, that is a functional regression, not an acceptable trade-off.

CoT does earn its place on tasks requiring mathematical, scientific, or multi-image reasoning, where the paper finds it genuinely helps. The problem the authors diagnose is not CoT itself but applying it uniformly without regard to whether a task is primarily perceptual or inferential.

The honest caveat is scope: the paper also notes that existing multimodal reasoning models show only "marginal overall improvements, possibly due to an overemphasis on mathematical reasoning," which raises questions about whether training data choices are compounding the problem. What the paper does not give you is precise numbers for how large the performance drops are on specific benchmarks, or whether the same degradation applies to closed frontier models not included in the study.

For teams building multimodal pipelines, the practical direction is task-aware routing: CoT on demand for reasoning-heavy subtasks, bypassed for perceptual ones. That shift alone could recover accuracy on visual grounding and counting without requiring any model changes.

Shared on Bluesky by 1 AI expert