Wav2Vec2 and HuBERT probe when linguistic structure emerges
TL;DR
- Researchers probed six Wav2Vec2 and HuBERT models trained on spoken Dutch across layers and intermediate training checkpoints.
- Different levels of linguistic structure showed distinct layerwise patterns and learning trajectories inside the same speech models.
- Higher-order prediction tasks using iteratively refined pseudo-labels induced greater parallelism across the models' internal layers.
A paper posted to arXiv in early April looks at a question that quietly matters if you use self-supervised speech models in production: not whether these models encode linguistic structure, which is by now well established, but *when* during training that structure actually shows up, and *where* in the layer stack it ends up sitting.
The authors, Marianne de Heer Kloots and colleagues, study six Wav2Vec2 and HuBERT models trained on spoken Dutch, probing them across layers and across intermediate training checkpoints. Their headline finding is that different levels of linguistic structure show, in their words, "notably distinct layerwise patterns as well as learning trajectories." Some things settle in early, some late, and they do not all live in the same layer. They argue this can be partly explained by how far a given structure sits from the raw acoustic signal and over what timescale the input has to be integrated.
The more actionable claim for anyone designing training recipes is about the objective itself. The team reports that "the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures," with higher-order prediction tasks, meaning iteratively refined pseudo-labels of the sort HuBERT uses, producing more parallelism across the network. That is a lever, not just an observation.
The honest caveat is what the abstract does not give you: it does not name which specific linguistic structures follow which trajectories, does not attach numbers to "early" or "late," and rests on six models in one language, spoken Dutch. Anyone tempted to redesign a probing pipeline or pick a fine-tuning layer off the summary alone should wait for the full paper.
Still, the direction is useful. If choice of pre-training objective really does reshape where inside a speech model different kinds of linguistic knowledge live, then the interesting design work for the next round of self-supervised audio models is not only scale and data, but also what you ask the model to predict.
Shared on Bluesky by 2 AI experts
-
Letβs study learning trajectories in self-supervised speech models! π Do they reflect the hierarchical organization of spoken language? We have analyzed a lot of training checkpoints to find out π Preprint: arxiv.org/aβ¦
View on Bluesky β
Originally reported by arxiv.org
Read the original article βOriginal headline: Tracking the emergence of linguistic structure in self-supervised models learning from speech