CUHK-ByteDance FlexiSLM makes a 7B speech LM's frame rate tunable
TL;DR
- FlexiSLM is presented as the first spoken language model with dynamic and controllable frame rates on both speech input and output.
- At 6.25 Hz, the 7B model roughly halves inference time versus 12.5 Hz while retaining strong speech-to-speech quality, per the authors.
- The authors report FlexiSLM outperforms fixed-rate Qwen2.5-Omni and Kimi-Audio at high-quality operating points, and can be steered down to 4.0 Hz.
A new paper from The Chinese University of Hong Kong, Shenzhen and ByteDance turns a small but underexplored dial on spoken language models: instead of committing to a fixed audio token rate, the model can decide, on the fly, how coarsely to represent speech. That single change, according to the paper on Hugging Face, lets one deployed model span a range of quality-speed operating points that today usually requires a different model.
Spoken language models like Qwen2.5-Omni run at 25 Hz and Kimi-Audio at 12.5 Hz, and both rates are baked in. FlexiSLM instead uses a frame-merging module that groups redundant adjacent frames by cosine similarity, and adds a conditioning signal so a user can directly specify the desired average output frame rate at inference. The 7B model, initialized from Qwen2.5-7B-Instruct, is described as the first SLM that supports dynamic and controllable frame rates on both speech input and output.
The reported numbers are the interesting part. The authors say FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points, can be accurately steered down to 4.0 Hz, and at 6.25 Hz roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Below that, the paper describes graceful degradation at 5.0 Hz and 4.0 Hz rather than parity, which is a more honest framing than the top-line beat suggests.
Why this matters for anyone deploying voice models: the fixed-rate assumption has been a hidden constraint on where speech agents can run and how they respond under load. If one model can drop to 5 or 4 Hz on a phone or a saturated GPU without a redeploy, product teams get a real speed lever instead of maintaining a fast small model alongside a slow good one. What the paper does not give you is a head-to-head against patching-based low-rate systems like Fun-Audio-Chat or Mimo-Audio, absolute latency on real hardware, or behavior of the rate condition on multilingual or noisy audio the training mix does not cover well. Take the specifics as reported, not settled.
If it holds up under outside evaluation, the downstream story is deployment. Code is slated for release, and the dynamic-rate strategy is called out as complementary to the patching tricks other labs already ship, so the more likely outcome is dynamic merging stacking onto existing SLM backbones rather than replacing them.
Originally reported by huggingface.co
Read the original article →Original headline: HF Paper 'FlexiSLM': First Dynamic and Controllable Frame Rate Spoken Language Model Adjusts Audio Token Rate on the Fly, Beats Fixed-Rate SLMs on Latency-Quality Tradeoff