arxiv.org web signal July 4th 2026

Only 33 of 650 LLM-judge papers cover low-resource langs

TL;DR

A new arxiv paper surveyed 650 ACL Anthology papers on LLM-as-a-Judge and found only 33 focused on low-resource or multilingual settings.
The authors warn that LLMs have limited proficiency in low-resource languages and that adequate human validation is often absent from these evaluations.
The audit flags inconsistent evaluation outcomes across studies and widespread reliance on a single judge model per study.

For anyone using LLM-as-a-Judge to grade model outputs, a new arxiv preprint from A. Seza Doğruöz, Xixian Liao, Verena Blaschke, Jakob Prange, Senyu Li and David Ifeoluwa Adelani is the kind of methodology check worth pausing on. The authors surveyed 650 papers from the ACL Anthology that mention LLM-as-a-Judge evaluation and found that only 33 of them focus on low-resource or multilingual settings. About five percent, in other words, of a paradigm the abstract calls dominant for natural language generation tasks.

The concern the authors keep circling back to is that LLM-as-a-Judge has, in their words, "high correlations with human judgment, albeit mostly in English", and that "LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings." Put those two sentences next to each other and you get the shape of the problem. The tooling that has quietly become the default way to say a model is good is being applied across languages where nobody has really checked whether the judge is fit for the job.

The audit surfaces three practical issues. Evaluation outcomes are inconsistent across studies. There is excessive trust in the LLM's own judgment when the target language is one it barely speaks. And there is "widespread reliance on a single judge model per study", which means when a particular judge has a blind spot, that blind spot gets baked into whatever claim the paper is making.

The honest caveat is that this is a survey and recommendations paper still under review, submitted on 2 July 2026, and the abstract does not name the specific languages, datasets or judge models the authors dug into. So take the 33-of-650 figure as a scoping claim about the shape of the literature, not a benchmark result. What the reporting also does not give you is a concrete acceptance rule teams could adopt tomorrow.

Still, the useful thing here is the direction of travel. If you evaluate multilingual products the same way you evaluate English ones, the field's own review of itself says you probably should not.

Shared on Bluesky by 2 AI experts

Originally reported by arxiv.org

Read the original article →

Original headline: Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages