arxiv.org via Hacker News

Arxiv Study: One Transformer Layer Rivals Full-Parameter RL

TL;DR

  • A new arxiv paper claims training a single transformer layer can recover most of the gains from full-parameter RL post-training.
  • The pattern held across seven models spanning Qwen3 and Qwen2.5, three RL algorithms (GRPO, GiGPO, Dr. GRPO), and math, coding and agentic tasks.
  • High-contribution layers cluster in the middle of the transformer stack, while layers near the input and output ends contribute substantially less.

A new arxiv paper argues that most of the work in RL post-training of large language models is being done by a very small slice of the network — often, the authors claim, a single transformer layer sitting somewhere in the middle of the stack. Training just that one layer, they report, recovers most of the gains of full-parameter RL, and in some runs even beats it.

The reason that is interesting is the standing assumption it pokes at. As the abstract puts it, existing approaches update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. The authors — Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Hongzhou Lin, and Mingyi Hong — introduce a metric they call layer contribution, defined as the fraction of full RL improvement recovered by training a layer in isolation. Using that lens, they say the gains from RL are heavily concentrated, and the same shape shows up whether they run GRPO, GiGPO, or Dr. GRPO, across seven Qwen3 and Qwen2.5 models, on math, code, and agentic decision-making tasks. High-contribution layers cluster in the middle, and the rankings stay strongly correlated across datasets, tasks, model families, and RL algorithms.

Why this matters if you are not doing the training yourself: the cost of post-training an open-weight model with RL is a big part of why the strongest fine-tuned Qwen and Llama variants sit inside well-funded labs. If a single middle layer really is doing most of the work, the compute and memory footprint of a serious RL run drops meaningfully, and small teams get a much shorter path to a competitive checkpoint. It also gives interpretability researchers a concrete place to look — the layers where the metric spikes are, by construction, where RL actually changed the model.

The honest caveat is that this is one paper, all of it on the Qwen family, and the abstract does not give the specific accuracy percentages or the winning layer index per model. Whether the middle-layer pattern generalizes to Llama, Mistral, or larger MoE architectures is exactly the kind of thing that has broken past 'this trick is universal' claims. What the reporting doesn't give you yet is how single-layer updates interact with LoRA-style PEFT or with quantized training, which is where a lot of practical fine-tuning already lives.

The part worth watching is whether the layer-contribution metric gets picked up as a diagnostic, independent of whether teams actually restrict training to one layer. Even as a lens on what RL is doing to a model, it is a cleaner instrument than staring at loss curves.