HONEST benchmark finds 4.3% of LM completions are hurtful
TL;DR
- The HONEST benchmark found language models produced a hurtful word in roughly 4.3% of sentence completions across the study.
- When prompts targeted women, 9% of completions referenced sexual promiscuity; when they targeted men, 4% referenced homosexuality.
- The template- and lexicon-based methodology was applied across six languages: English, Italian, French, Portuguese, Romanian, and Spanish.
A 2021 NAACL paper out of Bocconi's MilaNLP lab argued that the simplest way to find out what a language model thinks of you is to give it a prompt about you and watch how it finishes the sentence. The result Debora Nozza, Federico Bianchi, and Dirk Hovy landed on, in their paper HONEST: Measuring Hurtful Sentence Completion in Language Models, is uncomfortable but specific. Across the study, language models produced a hurtful word in roughly 4.3% of sentence completions.
The headline percentage is not the most interesting number in the paper. When the target of the prompt was female, 9% of completions referenced sexual promiscuity. When the target was male, 4% referenced homosexuality. Those are not random hits. The authors describe them as language and gender specific patterns that reflect societal stereotypes about gender roles.
What gives HONEST teeth as a benchmark is that the same template- and lexicon-based methodology runs across six languages, English, Italian, French, Portuguese, Romanian, and Spanish, so the same probe can be pointed at models trained on very different corpora. For teams shipping multilingual systems, that is the kind of measurement that lets you talk about harm in concrete numbers rather than in vibes.
The honest caveat is that this is a measurement paper, not a fix. It tells you what a model coughs up on a constrained set of prompts; it does not promise that a model scoring well on HONEST will behave well downstream, and it does not provide a remediation recipe. What the reporting does not give you is how the picture has shifted since 2021 as instruction-tuning and alignment work have become the default.
Five years on, the value of work like this is less the original numbers and more the methodology and the public scoring package. The Python implementation on GitHub means any team fine-tuning a model in one of the supported languages can run the same probe and get a comparable number, which is the part that keeps it useful.
Shared on Bluesky by 2 AI experts
-
#MemoryModay #NLProc 'Measuring Hurtful Sentence Completion in Language Models' by @debora_nozza et al. introduces HONEST, a new metric for harmful stereotypes. Language models use hurtful words 4.3% of the time.
View on Bluesky →
Originally reported by aclanthology.org
Read the original article →Original headline: HONEST: Measuring Hurtful Sentence Completion in Language Models