aclanthology.org web signal

Wikipedia audit finds bots dominate low-resource editions

TL;DR

  • MinHash deduplication removes 28.33% of all non-English Wikipedia articles, mostly from editions known to be dominated by bot-generated content.
  • Cited bot-content shares reach 99% for Cebuano, 90% for Waray and 68% for Swedish Wikipedia, per an Alshahrani et al. 2023 estimate.
  • Language models trained on the filtered Wikipedia largely match or outperform those trained on the raw dumps, with the biggest gains on lower-quality editions.

A group of computational linguists just pointed the kind of aggressive filtering pipeline usually reserved for scraped web crawls at the resource that half of multilingual NLP quietly leans on, and the amount of material those filters strip out is not a rounding error. Their ACL 2026 paper, from Kushal Tatariya, Artur Kulmizev and colleagues, audits the entirety of non-English Wikipedia and reports back with specifics worth sitting with.

MinHash deduplication alone, they write, is affecting 28.33% of all non-English Wikipedia articles, though only 8.18% of characters. That gap tells you the deleted items are short, largely placeholders and templated boilerplate from editions already known to be dominated by automated contributors. The paper cites an earlier Alshahrani et al. 2023 estimate putting bot-generated share at 99% for Cebuano, 90% for Waray and 68% for Swedish. Yorùbá Wikipedia loses close to 60% of its articles under exact-match deduplication, many of them entries containing only the single token Ìtọḱasí (Reference). About 6% of Assamese Wikipedia, they note, consists of foreign characters, with numerous articles written almost entirely in English.

The authors consolidate these findings into a four-level quality ranking of Wikipedia and then run three language-modelling scenarios on top of it. Their headline downstream result is that models trained on the filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains on the lower-quality editions. For a practitioner that is the useful part. Filtering is not just hygiene on the corpora most people worry about; it is a smaller and better-performing training set.

The honest caveat is that the bot-content percentages the paper leans on for Cebuano, Waray and Swedish come from the earlier 2023 estimate rather than a fresh count in this study, and the abstract does not publish per-language rankings or absolute benchmark scores. Take the specific figures as reported for the editions named, not as settled facts for every language. What the reporting also does not give you is a head-to-head against other multilingual corpora such as Common Crawl or HPLT for the same low-resource languages.

The part worth watching is who benefits. Multilingual model teams get a concrete filtering recipe that reduces training volume without a performance hit, and community editors on Cebuano, Waray, Yorùbá and Assamese now have a citable, specific list of failure modes to work against.

Shared on Bluesky by 2 AI experts