paper web signal

New arxiv paper caps LLM test-time sampling at a few dozen draws

TL;DR

  • The paper argues majority voting saturates within a few dozen draws, what the authors call the modal ceiling for selection.
  • The correlation ceiling hits even earlier for benchmark scoring, so extra samples add compute cost without lifting the score.
  • The authors' framing is that the bottleneck is recognizing a right answer, not generating one.

A new arxiv paper from Yong Yi Bay and Kathleen A. Yearick argues that the test-time scaling story labs are pouring compute into has a much lower ceiling than the standard 'more samples equals more capability' framing suggests. Coverage, which the authors define as the fraction of problems with at least one correct try, does keep climbing as you draw more samples. But a deployed system has to actually return one answer, and that selection step is where the wall sits.

The paper calls the wall the identifiability gap: the answer a model can produce but not pick. Two ceilings follow from it. The modal ceiling is where majority voting stops improving, which the authors say has already settled 'within a few dozen draws'. The correlation ceiling, which caps benchmark scoring, hits sooner still. Past those points, they argue, extra draws only cost compute and 'can even make the answer worse', because the votes concentrate on what the paper bluntly calls a confident mistake.

Why this matters if you are running or paying for a reasoning system: best-of-N and self-consistency have quietly become expensive default settings in a lot of production stacks, and the industry's assumption has been that pushing samples higher keeps buying accuracy. If the ceiling really is a couple of dozen draws for selection, then runs stretching into the hundreds are burning compute for a coverage number that never turns into a served answer. The authors propose collapsing the decision into a single quantity, an 'effective number of samples' that any sampling run already reveals, as the cutoff.

The honest caveat is that this is a compact theoretical framing rather than a broad empirical sweep, and the abstract itself does not pin down which specific models or benchmarks were used to fit the ceilings, so the 'few dozen' figure should be read as the claim the paper is making rather than as a settled constant. What the abstract also does not tell you is how these ceilings move when a strong external verifier or process reward model is layered on top, which is where a lot of the current test-time compute spending is actually going.

The direction worth watching, whether or not the exact ceilings hold, is the reframing the authors close on: the bottleneck is recognizing a right answer, not generating one. If that lands, the next round of test-time gains has to come from better selection and verification, not from turning the sampling dial higher.