reddit.com via Reddit

Claude Opus 4.8 ranks last on EyeBench vision test

anthropic computer vision multimodal model-benchmarks computer-vision

Key insights

  • EyeBench-V3 strips language scaffolding from visual tasks, exposing spatial perception gaps that standard multimodal benchmarks consistently hide.
  • Claude Opus 4.7 scored 16% on EyeBench, Anthropic's personal best, while GPT-5.5 and Gemini 3.5 Flash scored substantially higher.
  • Opus 4.8 continued the same pattern, suggesting Anthropic's visual perception gap is structural across multiple model generations.

Why this matters

Multimodal benchmarks have consistently favored captioned tasks where language ability masks weak visual reasoning, so EyeBench-V3 provides the first systematic public evidence that Anthropic's vision gap is both real and durable across model generations. Practitioners building vision-heavy applications on Claude, including robotics perception pipelines, medical imaging assistants, and spatial reasoning tools, now have data indicating they should benchmark workloads on non-captioned inputs before committing to the platform. If the pattern holds through another full model cycle, Anthropic will face increasing pressure to explain whether the gap reflects an architectural constraint or a training data problem, with direct consequences for enterprise customers running multi-model evaluations.

Summary

A community benchmark for raw visual perception placed Claude Opus 4.8 last among frontier models, extending a gap that has persisted across multiple Anthropic generations. EyeBench-V3, run by researcher @adonis_singh, strips the language scaffolding that lets models simulate visual understanding via caption matching. Claude Opus 4.7 scored 16% on the prior version, Anthropic's highest result to date, while GPT-5.5 and Gemini 3.5 Flash scored substantially higher. Opus 4.8 maintained that deficit. Essentially: (Anthropic, OpenAI, Google) are diverging on spatial perception as a distinct capability axis. - EyeBench-V3 surfaces gaps hidden by standard leaderboards that over-index on captioned inputs. - Anthropic's ceiling remains 16% across benchmark versions, unchanged despite model iteration. - Opus 4.8 shows no meaningful improvement on perceptual tasks that strip away text anchors. Repeated model upgrades without closing the gap points to a structural issue in Anthropic's vision training pipeline, not a one-cycle miss.

Potential risks and opportunities

Risks

  • Enterprise customers using Claude for vision-heavy workloads including document parsing pipelines and medical imaging assistants may migrate to GPT-5.5 or Gemini 3.5 if Anthropic fails to close the gap within the next model cycle
  • Anthropic's positioning in robotics and physical AI partnerships weakens if spatial perception benchmarks become standard criteria in procurement evaluations, where rivals already demonstrate substantially higher scores
  • Developer trust in Anthropic's multimodal roadmap erodes if Opus 4.9 or Claude 5 repeats the same result, converting a community benchmark thread into a sustained reputational signal across multiple model cycles

Opportunities

  • OpenAI and Google can cite EyeBench-V3 results in enterprise sales cycles targeting computer vision and spatial AI applications where Claude is the incumbent model
  • Multimodal benchmark tooling developers and community researchers like @adonis_singh gain credibility and citation leverage as non-captioned evaluation suites attract more adoption from practitioners
  • Vision-specialized fine-tuning services and model routing providers could position themselves as gap-filling layers for Anthropic enterprise customers with spatial reasoning requirements

What we don't know yet

  • Whether @adonis_singh's raw per-task scores for Opus 4.8 have been published or peer-reviewed beyond the original Reddit thread
  • What specific architectural or training data differences explain GPT-5.5 and Gemini 3.5 Flash's substantially higher scores on the same spatial perception tasks
  • Whether Anthropic has internally acknowledged the EyeBench pattern and whether any dedicated vision-focused training runs are scoped for the next model generation