paper web signal

RedVox: Only 8% of Speech AI Releases Test Multilingual Safety

TL;DR

  • Only 8% of 38 recently surveyed speech model releases document any multilingual safety analysis, according to the RedVox paper.
  • All eight tested systems, including Qwen2-Audio, Voxtral, Gemma 4, Gemini 3.1 variants, and GPT-realtime-2, showed persistent safety issues under non-adversarial conditions.
  • Unsafe responses roughly doubled in non-English settings, and spoken input amplified harmful outputs versus the same text request.

A new benchmark paper puts a number on something a lot of people building voice products have suspected. Of 38 recent speech model releases the authors surveyed, only 8% document any multilingual safety analysis. That is the through-line of RedVox, a paper from Beatrice Savoldi, Sara Papi, Wafa Aissa, Matteo Negri and Luisa Bentivogli that tests eight state-of-the-art systems across five languages: English, French, Italian, Spanish and German.

The eight systems include five openly available models, Qwen2-Audio, Phi4-Multimodal, Voxtral, Qwen3-Omni and Gemma 4, alongside three proprietary ones, Gemini 3.1 variants and GPT-realtime-2. The authors report that all eight show persistent safety issues even under non-adversarial conditions, and that unsafe responses roughly double in non-English languages. Spoken input, they argue, acts as a stressor: the same request delivered by voice tends to produce worse outputs than the text-only equivalent, and stereotypical prompts draw responses the paper calls highly controversial.

Why that matters if you build with these models: the safety story you get from a vendor's release notes is almost certainly an English safety story. If your users speak French, Italian, Spanish or German, the authors' data suggests you are inheriting a materially different risk profile than the one the model card describes, and the gap widens further as soon as the input is voice rather than text. That is a different problem from the usual translation-quality complaint about multilingual models.

The honest caveat is that RedVox covers five European languages and eight named systems, so it is not a claim about every language or every model, and unsafe is measured against the authors' own rubric of unfair and stereotypical requests. What the paper does not give you is a per-model breakdown at the level a procurement team would want, or a mitigation recipe, since this is a measurement paper rather than a fix. The upside is that a public benchmark makes it harder for the next round of speech model releases to ship without at least addressing the question.

Shared on Bluesky by 1 AI expert