github.com web signal

Open recipe ships per-language BPE tokenizers for Whisper

TL;DR

  • A new open-source repo publishes byte-level BPE tokenizers for languages including Kamba, Arabic, Mandarin, Cantonese, Japanese, Spanish, and Swahili.
  • The recipe keeps a 51,865-token vocabulary matching Whisper-large-v3 and lifts the max token length from 16 to 32 bytes for multi-byte scripts.
  • An audit across 102 FLEURS languages reports cross-word merges dropping from 2,656,091 to zero, with 100% round-trip integrity.

A small open-source release worth a look if you fine-tune Whisper on languages the base model handles poorly. A new GitHub repo publishes per-language byte-level BPE tokenizers, plus a reproducible recipe to train more, keeping the 51,865-token vocabulary size that Whisper-large-v3 expects. Pre-built tokenizers ship for Kamba, Arabic, Mandarin, Cantonese, Japanese, Spanish, and Swahili, and per the README the trainer supports 100+ languages from the FLEURS dataset.

The interesting part is what the v4 recipe actually fixes. Per the repository, it resolves a regex-wrapping bug, bumps the max token length from 16 to 32 bytes so multi-byte scripts have room to form real merges, and guarantees all 256 initial bytes survive training. The author reports an audit across 102 languages where cross-word merges drop from 2,656,091 to zero, with 100% round-trip integrity.

Why should you care if you are not training tokenizers yourself? A per-language BPE that stays byte-compatible with Whisper is the kind of quiet plumbing that can meaningfully help fine-tuning on low-resource languages, especially ones with non-Latin scripts. The alternative is either living with the default multilingual tokenizer or rolling your own, and neither is great if you are, say, a researcher trying to get Whisper to behave on Kamba or Swahili.

The honest caveat is that the repo publishes tokenizer quality metrics, not downstream ASR quality metrics. Zero cross-word merges and clean round-trips are the tokenizer's own health check, not a promise your Whisper fine-tune will produce a lower word error rate than the default. The README itself flags a diagnostic example it calls the 'Kamba rep-trap' about memorization risks on small corpora, and this is a solo project under an MIT license, so anyone depending on it should mirror the code rather than assume it will keep pace with future Whisper releases. Take the specifics as reported, not settled.

Still, low-resource ASR keeps being one of the places open recipes move faster than any single lab, and a compact, reproducible piece of tokenizer plumbing that anyone can extend to more of the FLEURS language set is a small good thing to have sitting in the ecosystem.

Shared on Bluesky by 2 AI experts