Study: 18.6% of RLHF Harm Labels Flip When Raters Filtered
TL;DR
- A new arxiv preprint argues RLHF annotator responses may not represent genuine preferences at all, but responses constructed on the spot.
- Filtering high-inconsistency annotators in two RLHF datasets flipped majority harm classifications for 18.6% of prompts.
- The same filtering shifted mean ratings by more than 13 points on a 100-point scale, suggesting systematic rather than random noise.
An arxiv preprint from Bijean Ghafouri, Eun Cheol Choi, Priyanka Dey and Emilio Ferrara puts a pointed question to the whole preference-learning stack: are the human labels that drive RLHF actually preferences at all, or responses invented on the spot? The paper is titled "RLHF May Not Reflect Genuine Preferences," and the argument is that much of what looks like human values in annotation data is really elicitation artifact.
Their headline result from two RLHF datasets: when they filtered out the annotators who were most inconsistent with themselves, majority harm classifications flipped for 18.6% of prompts, and mean ratings shifted by over 13 points on a 100-point scale. That is not the size of movement you would expect if disagreements were random noise averaged out by scale. The authors call the pattern "systematic and directionally biased," and describe current practice as one that "models noise as signal and elicitation artifacts as human values."
The behavioural claim underneath will feel familiar to anyone who has run surveys. In the authors' words, "people produce responses without holding genuine opinions, construct preferences on the spot from contextual cues, and interpret identical questions differently." What the paper proposes in response is diagnostics that locate individual responses on a spectrum from non-attitudes to genuine preferences before those responses become a reward signal. Their framing is that "measurement validity is logically prior to preference aggregation. Before asking how to combine annotations, the field must ask whether the responses being combined are preferences at all."
The honest caveat is that this is a preprint, and the abstract does not name the two datasets, the number of annotators filtered, or the threshold used to define high inconsistency, so the effect size is precise but the sample behind it is not, at least at the level published. It also does not tell us whether the same annotator inconsistency shows up in the proprietary preference data behind frontier models, which is where the practical impact would land.
If the finding generalises, the interesting move for labs is upstream of RLHF: rater screening, repeat-question consistency checks, and reward models that weight annotators rather than treating each vote equally. That is a cheaper intervention than replacing RLHF, and the teams that ship it first would be lifting alignment quality without touching their training recipe.
Shared on Bluesky by 2 AI experts
-
For today's reading group, @esradonmez.bsky.social presented "RLHF May Not Reflect Genuine Preferences" by Ghafouri et al. (2026). Interesting thoughts on whether annotations are actually real preferences! Paper: arxiv.…
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: RLHF May Not Reflect Genuine Preferences