paper web signal

Universal Speech Enhancement Model Adds Runtime Latency Control

TL;DR

  • A single speech enhancement model exposes both algorithmic and computational latency as inference-time controls instead of requiring separately trained per-budget models.
  • Algorithmic latency is set via configurable look-ahead frames with parallel convolutional layers; computational latency is set via an early-exit mechanism that stops inference at different network depths.
  • A two-stage training strategy with a shared-to-multiple decoder transition is used to narrow the gap between the flexible model and per-latency specialists.

Real-time audio cleanup is fragmented in an annoying way. Different applications have different latency budgets, and the usual answer has been to train a separate enhancement model for each budget. A new paper posted to arxiv titled "One Model, Many Latencies" proposes collapsing that into one model where you dial latency in at inference time.

The proposal splits latency control into two knobs. Algorithmic latency, the look-ahead the model is allowed, is exposed as a configurable number of look-ahead frames, with parallel convolutional layers tied to the different look-ahead settings so the network is not trying to learn one branch under varying padding configurations. Computational latency, how much compute you spend per frame, is exposed through an early-exit mechanism that lets you stop inference at different network depths.

The gap to close is the obvious one. A unified model tends to underperform a specialist trained for one latency point. The authors say they narrow that gap with a two-stage training strategy that transitions from a shared decoder to multiple decoders, though the abstract does not pin down by how much.

The honest caveat is that the abstract does not give specific latency ranges in milliseconds, does not name benchmark datasets, and does not publish numbers for the specialist-versus-flexible quality gap. What it does give is a downloadable model: the released weights are hosted on Hugging Face under an NVIDIA-named org, which is the strongest signal in the paper that this is meant to be picked up and used. For anyone shipping a real-time audio product, that is the part to watch. A single weights file that drops into different deployment targets without retraining is cheaper to maintain than a fleet of specialists, and the open question is whether the quality gap to those specialists is small enough to make the consolidation worth it.