Cornell-Mila paper: taper MLP widths for free perplexity gains
TL;DR
- A cosine MLP width taper from 1.5x to 0.5x cut perplexity from 16.28 to 14.44 on a 440M transformer with identical parameters and FLOPs.
- Gains held across all four tested architectures (Transformer, Gated Attention, Hope-attention, Titans) at both 760M and 1.3B scales.
- Later MLP layers show increasing output alignment with the residual stream, mechanistically explaining why front-loading capacity helps.
The standard practice since the original transformer has been to allocate the same parameter budget to every layer of a model, a default that has remained largely unexamined as models scaled by orders of magnitude. A new paper from researchers at Cornell and Mila challenges that convention with a strikingly simple finding: smoothly shrinking MLP width from early to late layers, while keeping the total parameter count and compute budget identical, consistently improves perplexity and downstream benchmark accuracy.
The paper introduces what the authors call Tapered Language Models (TLMs). The key intervention is replacing the constant intermediate dimension of each MLP layer with a per-layer width that decreases monotonically via a cosine decay schedule. On a 440M Transformer, a cosine taper starting at 1.5x the baseline MLP width and ending at 0.5x improved in-distribution validation perplexity from 16.28 to 14.44, a gain of 1.84 points with no added parameters or FLOPs. The authors swept three schedules (linear, cosine, and sigmoid) and five width ratios; cosine at 1.5/0.5 was the consistent winner across all combinations.
What makes this more than a transformer-specific trick is the breadth of the evaluation. The same configuration was applied across four architectures: standard Transformer, Gated Attention, Hope-attention, and Titans, at 760M and 1.3B parameter scales, trained on 50B and 100B tokens respectively. Tapering improved average commonsense accuracy in all eight architecture-scale combinations, without exception, and improved LAMBADA perplexity in every single one.
The mechanistic explanation the authors offer is direct: later MLP layers produce outputs increasingly aligned with the residual stream, reinforcing what is already there rather than computing new features. Measuring cosine similarity between MLP outputs and the incoming residual stream across the GPT-2 family, they find Pearson correlations between layer depth and this alignment ranging from r=0.49 to r=0.71 for MLP outputs. Tapering moves capacity from where it is least used to where it contributes most.
The honest caveat is that the optimal schedule and width ratio were selected on a 440M Transformer and applied unchanged to all other experiments, so the authors themselves note the chosen configuration is best read as a robust default rather than a global optimum. What the paper does not address is whether the benefits hold at frontier scales well above 1.3B parameters, or whether tapering interacts with MoE routing. Still, for any team about to initialize a pretraining run on a standard architecture, this looks like a genuinely free improvement.
Originally reported by huggingface.co
Read the original article →Original headline: Cornell and Mila Introduce Tapered Language Models: Cosine MLP Width Tapering Improves Perplexity Across Four Architectures at Zero Additional Compute