paper web signal

Muon Beats AdamW Under One-Step Delay in Async LLM Training

TL;DR

  • A new paper argues one-step gradient delay in async pipeline pretraining is not a fundamental barrier but depends strongly on optimizer choice.
  • AdamW shows severe degradation under one-step delay, while Muon stays robust in identical conditions.
  • Authors test up to 10B parameters and add an Error Feedback-inspired correction with convergence guarantees for Muon.

A new paper on arXiv takes a swing at one of the more frustrating tradeoffs in large-model pretraining, and the answer it lands on is unexpectedly simple. The problem might not be the gradients. It might be the optimizer.

Asynchronous pipeline parallelism is attractive because it eliminates the GPU idle time you get from pipeline bubbles in synchronous training, at the cost of one-step gradient delay. The conventional read on that delay is that it hurts training enough to be impractical for serious work. Philip Zmushko and colleagues argue this conclusion is optimizer-specific rather than fundamental. In their experiments, AdamW experiences severe degradation under one-step delay, while Muon exhibits strong robustness under identical conditions.

They then go further and propose an optimizer-agnostic, Error Feedback-inspired correction to mitigate delay effects, and they report theoretical convergence guarantees for Muon with and without that correction. The empirical scale matters: testing went up to 10B parameters using PipeDream-2BW, with the claim that their strategies bridge the performance gap with synchronous training.

The honest caveat is that 10B is the scale tested, not frontier scale, and a single paper's optimizer comparison is the start of a conversation rather than the settled answer. What the reporting doesn't give you is the wall-clock speedup in practice, or whether the same robustness transfers to optimizer families beyond Muon and AdamW.

If it holds up under replication, the practical winners are teams running pipeline-parallel hardware where bubble time is genuinely the bottleneck. Academic groups, smaller labs, and anyone who can't simply throw more synchronous tensor-parallel hardware at the problem stand to recover real compute from a setting that has, until now, been treated as a curiosity.