arxiv.org web signal July 3rd 2026

EduArt benchmark exposes LLM weakness beyond multiple-choice

TL;DR

EduArt tests LLMs with 871 human-authored art history questions drawn from Italian secondary-school and US AP exams.
The benchmark spans two languages and seven question formats and evaluates twelve models from six provider families.
Models near 94% on multiple-choice fell to 23.9% on open completion and 6.2% on error identification.

An art history benchmark posted to arXiv this month is a small but pointed reminder that how you ask a model matters as much as what you ask. On multiple-choice items, some large language models scored near a 94 percent ceiling. Change the format to error identification and one model dropped to 6.2 percent. Change it to open completion and another fell to 23.9 percent. That is the same underlying knowledge, tested differently, with radically different results.

The paper, posted to arXiv by Gianmarco Spinaci, Lukas Klic, and Giovanni Colavizza, introduces EduArt, a benchmark of 871 human-authored questions drawn from Italian secondary-school exercises and US Advanced Placement Art History exams. It spans two languages and seven question formats, and the authors ran twelve models from six provider families through it under two conditions: an answer-only pass and one that required written justification.

Their headline finding, in the paper's own phrasing, is that single-format benchmarks overestimate what models can reliably do. The mean discrimination index across items was 0.514, with 82.3 percent classified as effective discriminators, which is another way of saying the tasks actually separate stronger from weaker performers rather than being noise. The interesting failure is not that models get art history wrong in general. It is that the same model can look near-expert on one question type and near-useless on another.

If you are shipping an AI tutor or a grading assistant into a classroom, that gap is worth stress-testing before rollout. Multiple-choice accuracy is the easiest number to advertise and the least representative of what students actually produce in a humanities class. The honest caveat is that this is one benchmark, in one domain, with a specific mix of Italian and English items, and the abstract does not tell you which of the twelve models hit the 6.2 percent floor or which one sat at the 94 percent ceiling. But the direction, that format sensitivity is doing more work than most leaderboard numbers admit, is the part practitioners should carry into their own evaluations.

Shared on Bluesky by 2 AI experts

Originally reported by arxiv.org

Read the original article →

Original headline: EduArt: An educational-level benchmark for evaluating art history knowledge in large language models