arxiv.org web signal

Mei and Montanari pin down random features double descent

TL;DR

  • Song Mei and Andrea Montanari analyze ridge regression on N random features, equivalent to a two-layer neural network with random first-layer weights.
  • They compute precise asymptotics of test error in the limit where N, n, and d go to infinity with N/d and n/d held fixed.
  • Their setup is described as the first analytically tractable model capturing all features of the double descent curve without ad hoc misspecification.

A 2019 paper by Song Mei and Andrea Montanari, posted to arXiv and last revised in December 2020, took the 'double descent' curve out of empirical folklore and gave it a precise mathematical description. The setup is deliberately stripped down. The authors perform ridge regression on N random features of the form σ(w_a^T x), which, as they note, can be equivalently described as a two-layer neural network with random first-layer weights.

The motivation is the puzzle that modern deep learning keeps shoving in everyone's face. Neural network architectures often contain more parameters than training samples, are rich enough to interpolate labels even when those labels are pure noise, and yet still achieve small generalization error on real data. Mei and Montanari's contribution is to compute the precise asymptotics of the test error in the joint limit where the number of features N, the number of samples n, and the input dimension d all go to infinity with the ratios N/d and n/d held fixed. In that limit the test error traces the now-familiar shape: a classical U at first, a peak around the interpolation threshold where training error vanishes, then a second descent into the overparametrized regime, where the global minimum of the test error often sits.

The paper's claim, in the authors' own framing, is that this is the 'first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.' That last clause is doing the work. The abstract is explicit that earlier simpler settings, including linear regression with random covariates, had already exhibited elements of this behavior. What is new here is that the full curve falls out of the geometry of random features regression on the d-dimensional sphere, with no awkward assumptions bolted on to make it appear.

The honest caveat is that this is a stylized setting. The first layer is frozen at random initialization, the inputs live on a sphere, and the analysis is asymptotic in a very specific scaling. None of that maps cleanly onto a trained deep network with learned features. What the abstract does not try to address is what happens once features are learned, or how the picture changes outside this regime.

Still, as a reference point for the theory of overparametrized learning, this remains one of the cleanest closed-form results in the literature, and the kind of paper practitioners can reach for when they need a story for why their absurdly overparametrized model is not supposed to be working but is.

Shared on Bluesky by 1 AI expert