arxiv.org web signal

ComplexityMT finds machine translation shifts CEFR levels

TL;DR

  • ComplexityMT benchmarks machine translation across Arabic, Dutch, English, French, Hindi and Russian using CEFR as the measure of text complexity.
  • Higher CEFR levels make texts more difficult to translate, and MT systems shift the CEFR level of the target versus the source for most languages.
  • The study evaluates three open-weight models, one closed model and one commercial machine translation system on two CEFR-grounded tasks.

Translation isn't just about getting the meaning across; the difficulty level of the source matters too, and a new benchmark on arxiv suggests machine translation systems quietly shift that difficulty in the target. The paper, ComplexityMT on arxiv, uses the Common European Framework of Reference for Languages, the CEFR levels familiar to anyone who has taken a language class, as its measure of text complexity, and runs it across Arabic, Dutch, English, French, Hindi and Russian.

The authors test three open-weight models, one closed model, and one commercial machine translation system across two tasks: whether CEFR correlates with how hard a text is to translate, and whether translations preserve the source's CEFR level. Both signals point the same direction. Higher CEFR levels make texts more difficult to translate, and machine translation shifts the CEFR level of the target compared to the original source for most of the languages tested.

That shift is the part worth dwelling on if you build anything that uses MT to feed language learners or to localize material aimed at a specific reading level. A piece of text written at B1 that comes out at A2 or B2 on the other side isn't a failed translation in the usual fluency sense; it just isn't the same difficulty band, which is a problem if a curriculum or accessibility setting expected the original level to hold.

The honest caveat is that the abstract keeps the specifics tight. It does not name which open-weight or closed models were used, does not say in which direction the CEFR drift tends to go, and does not break down which language pairs were worst affected. Take it as a framing benchmark rather than a verdict on any one system.

The thread to pull on is what happens once CEFR-shift becomes a routine metric alongside the usual translation quality scores. EdTech platforms, accessibility tools, and translation vendors all have a reason to care, because complexity-preserving output might end up being the next axis of differentiation in a market that has mostly competed on fluency.

Shared on Bluesky by 2 AI experts