LAIT study: readers still prefer humans for literary translation
TL;DR
- Readers picked human translations in 19 of 30 excerpt comparisons and 522 of 772 chunk-level pairs across French, Polish, and Japanese novels.
- When asked to identify which version was human, 15 readers got it right only 17 of 30 times, close to chance.
- The authors report that automatic metrics, including LLM-as-a-judge setups, fail to recover reader preferences and favor machine translations.
The interesting bit in a new arxiv preprint isn't that AI can translate literary fiction adequately. It's that adequate isn't what readers actually want, and the automatic scorekeepers the field relies on don't seem to know that.
The setup is small but pointed. Fifteen avid readers compared human and machine translations of 15 recent novels from French, Polish, and Japanese into English, in two modes: full excerpts and finer-grained aligned chunks. Machine output was rated 'fine'. Readers still preferred the human version in 19 of 30 excerpt comparisons and in 522 of 772 chunk-level pairs. They said they valued human translations for 'ease, clarity, and immersive nature'.
The blind-guessing result is the one that stuck with me. Readers correctly identified which version was human only 17 of 30 times, roughly a coin flip, and the authors report that readers 'tend to prefer the version they believe to be human'. Preference is partly a story about attribution, not only about the prose on the page. That is a useful caution for anyone reading vendor demos of AI translation quality.
Why this matters if you build or buy translation tools: the paper argues that 'automatic metrics, including LLM-as-a-judge approaches, fail to recover reader preferences and favor MT'. That is the uncomfortable part. The scoreboard your system optimizes against may be quietly telling you a different story than the humans it is meant to serve. Alongside the paper the authors are releasing LAIT (Literary AI Translation), roughly 1,000 reader comments, 2,000 judgments, and 7,200 span-level annotations, meant to give the field a reader-grounded benchmark to correct against.
The honest caveat is scale. Fifteen readers and fifteen novels across three source languages is a signal, not a settled verdict, and the study is about literary prose specifically. What the reporting here doesn't give you is a system-by-system comparison, a breakdown of which literary features the machine drops, or evidence that a smarter automated metric could close the gap. For publishers and translation vendors experimenting with post-editing pipelines, the takeaway is still worth sitting with: the metric that says 'the model is fine' and the reader who says 'I preferred the other one' can both be right at the same time.
Originally reported by paper
Read the original article →Original headline: AI Literary Translation Is Fine—But Readers Still Prefer Human Translators, Even When They Can't Tell Them Apart