State-Prediction Separation paper claims 2-3pp transformer gain
TL;DR
- The paper argues standard transformers waste capacity by using one forward stream to both predict the next token and store state for future tokens.
- A two-stream variant outperformed standard transformers by 2-3 percentage points on average on downstream tasks across pretraining scales.
- Authors report better data and compute efficiencies and say their empirical analysis rules out confounders and shows a real gradient difference.
A short new preprint has a pretty clean argument about where standard transformers might be leaving performance on the floor. The forward pass, the authors say, is doing two jobs at once, predicting the next token and storing useful state that future token predictions will need, and they claim disentangling those two roles into separate computation streams gives you a measurable win.
The arxiv paper, from Giovanni Monea, Nathan Godey, Kianté Brantley, and Yoav Artzi, reports that their two-stream Transformer variant outperformed standard Transformers by 2 to 3 percentage points on average on downstream tasks, with better data and compute efficiencies at the same time. They say they ran pretraining experiments across various scales, and that empirical analysis rules out potential confounders and shows a fundamental difference in the gradients the design produces.
Why this is interesting even if you never touch pretraining code: a claim of the form 'same compute budget, different topology, a couple of points of eval' is exactly the kind of finding labs move fast to reproduce. If it survives, it slots into existing training runs without a bigger cluster, and the groups that gain the most are the smaller ones for whom two or three points on a benchmark is the difference between a competitive model and an also-ran.
The honest caveat is that this is a preprint abstract, not a peer-reviewed benchmark sweep. Which downstream tasks specifically got lifted, what the maximum scale actually tested is, and what the inference-time cost of running two streams instead of one looks like are not things I can pin down from what's on the arxiv page. Take the numbers as reported, not settled. But the underlying idea, that a single forward stream is being asked to do incompatible things and paying for it, is the kind of clean structural claim other labs will want to test quickly.
Originally reported by paper
Read the original article →Original headline: Cornell's State-Prediction Split Gains 2-3pp Across Transformer Benchmarks at No Extra Compute