arxiv.org web signal June 29th 2026

Sparse autoencoder features reveal 'people-tuned' brain region

TL;DR

Lepori, Kay and Tuckute introduce 'Augmented Sparse Encoding Models,' swapping dense LM hidden states for hierarchically organized sparse autoencoder features plus surprisal.
Using 7T fMRI from eight participants hearing 200 sentences, they identify a previously uncharacterized voxel population tuned to people-related content.
Frontal regions of the fronto-temporal language network are relatively well explained by surprisal alone, even without LM-based features.

Most studies that line up brain activity with large language models end up comparing one opaque system to another, and a new paper from Michael Lepori, Kendrick Kay and Greta Tuckute tries to fix that by giving the LM side an interpretable substrate. In a paper posted to arXiv, they introduce 'Augmented Sparse Encoding Models,' a framework that replaces the dense hidden states of a language model with hierarchically organized sparse autoencoder features and adds surprisal as an explicit predictor.

The setup is a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences. As a sanity check the framework recovers earlier findings, voxel populations tuned to processing difficulty and meaning abstractness, and then goes a step further. The authors report a previously uncharacterized but reliable voxel population that, in their words, 'is tuned to people-related content.' They also find that the fronto-temporal language network is predicted by 'a common set of features across its constituent regions,' while frontal regions are relatively well explained by surprisal alone, even without the LM-based features.

The broader claim the paper wants you to take away is about alignment itself. Brain responses are not predictable from an arbitrary set of LM features. They are best explained by the features that capture 'the most general information' in the LM's representations, which the authors read as a nontrivial correspondence rather than coincidence.

The honest caveat is the scale. Eight participants and 200 sentences is standard for high-field fMRI but small for sweeping claims about how brains and LMs converge, and the abstract does not name the specific language model or sparse autoencoder used, so choice-of-model sensitivity is an open question. What the work does buy you, if it holds up, is a more legible bridge between LM internals and cortical responses than dense encoding studies have offered, which is the part both interpretability and neuroscience groups will want to push on.

Shared on Bluesky by 2 AI experts

Originally reported by arxiv.org

Read the original article →

Original headline: Interpreting Brain Responses to Language with Sparse Features from Language Models