Bertran, Roth, Wu: Compression Predicts ML Generalization
TL;DR
- ML strategies that generalize well can be described in very few tokens, Bertran, Roth, and Wu argue.
- A reproducer agent given only a brief prompt successfully replicated high-performance models found by a full exploration agent.
- The framework was tested across 8 datasets covering tabular, vision, language, diffusion, and reward modeling tasks.
One of the quiet puzzles in ML research is that benchmark leaderboards should, by now, be thoroughly gamed. Researchers adaptively probe the same held-out test sets over and over, and classical statistics suggests this should erode generalization. Mostly, it hasn't. A new paper on arxiv from Martin Andres Bertran, Aaron Roth, and Zhiwei Steven Wu proposes a reason: strategies that actually work are highly compressible, and compression, it turns out, implies generalization.
The team tested this idea with LLM-driven research agents across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling. They used two information bottleneck approaches. In the first, an exploration agent searches for high-performance models using a validation set, while a separate reproducer agent tries to replicate those results using only a brief prompt and training data. In the second, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. In both cases, short prompts and compressible feedback proved sufficient to reproduce and find high-performance models.
The hypothesis is falsifiable in a useful direction: when the researchers deliberately induced overfitting, the reproducer agent failed to replicate results from compressed prompts. The implication is that successful strategies occupy a low-complexity region of strategy space, which is why benchmark-driven ML development has proven more robust than the statistics might suggest.
The honest caveat is that 8 datasets is a narrow base for a broad claim, and the paper doesn't specify what token count constitutes "few" or how the dynamic scales to more complex domains. It also leaves open whether the same pattern holds for human researchers doing benchmark hunting rather than LLM agents. What the paper doesn't give you is a practical threshold or a ready-made recipe for benchmark designers.
For AutoML teams and anyone building iterative ML pipelines, the one-bit feedback finding is the most actionable part: if a strategy requires a long description to communicate, that complexity may itself be a signal worth taking seriously.
Shared on Bluesky by 2 AI experts
-
Modern LLMs are incredibly good compression algorithms, which can shed light on why autonomous data science agents don't overfit as much as you might think. arxiv.org/abs/2606.11045
View on Bluesky →
Originally reported by arxiv.org
Read the original article →Original headline: What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents