paper web signal

Speech LMs Latently Transcribe Audio to Text, Paper Finds

TL;DR

  • A new paper uses logit lens analysis to show interleaved speech-text models internally produce a readable text transcription in intermediate layers.
  • The text token of the spoken word appears as a top candidate in as much as 77% of the data, despite no speech recognition training.
  • The authors describe three phases: implicit transcription, next-word prediction in text space, then transformation back to the speech domain.

There is a quiet result in a new speech-LM paper that scratches at the design story behind one of the more popular architectures. When you train a model on interleaved speech and text tokens, the arxiv paper by Talia Sternberg, Gallil Maimon and Yossi Adi argues, the model latently routes its computation through text, with the spoken word becoming decodable as a text token in the intermediate layers before anything else useful happens.

The evidence comes from logit lens analysis across different model families and sizes. The authors decode what each intermediate layer would predict on its own, and report that the text token of the spoken word shows up as one of the top candidate words in as much as 77% of the data, despite the model never being trained for speech recognition. They describe a three-phase pattern: an implicit transcription stage, a next-word prediction stage that happens in text space, and then a transformation back to the speech domain on the way out.

The reason this matters for anyone building or evaluating speech LMs is that the design rationale and the actual mechanism may not match. The interleaved approach is sold as a way to boost speech-only capabilities by mixing in text; this analysis suggests text is closer to the load-bearing path than the framing implies. The authors say they also looked at the role of interleaving data and initialization from text LMs in eliciting the behavior, which points at the knobs that matter: data recipe and text-LM initialization, not necessarily exotic speech-native architectures.

The honest caveat is that 'as much as 77%' is a ceiling figure, not an average across all inputs, and logit lens decoding is a useful but blunt probe; a token being decodable at a layer is not the same as the model committing to it. The abstract also does not name the specific model families and sizes tested, nor does it say how the pattern holds for accented or non-English speech, or for audio whose meaning depends on prosody the transcript cannot carry.

The forward-looking part is what to watch next. If follow-up work confirms the implicit transcription stage is load-bearing, the next generation of speech LMs gets cheaper to develop, because you deliberately optimize a stage that is already there rather than chase a speech-native architecture that may not actually be inside today's systems.