NVIDIA's Nemotron-TwoTower: 2.42x faster at 98.7% quality
TL;DR
- NVIDIA's Nemotron-TwoTower splits an LM into a frozen autoregressive context tower and a trainable diffusion denoiser with bidirectional block attention.
- The system is built on Nemotron-3-Nano-30B-A3B, a 30B hybrid Mamba-Transformer MoE backbone, and trained on roughly 2.1 trillion tokens.
- The authors report retaining 98.7% of the autoregressive baseline's quality while delivering 2.42x higher wall-clock generation throughput.
A new paper from NVIDIA researchers argues that the reason diffusion language models keep underperforming autoregressive ones is a role-conflict problem: the same network is being asked to both represent clean context and iteratively denoise noisy tokens, and a single model gets stretched thin doing both.
Their answer, posted on arxiv, is Nemotron-TwoTower. The design is a block-wise autoregressive diffusion model that splits the work into two towers. A frozen autoregressive context tower processes clean tokens causally. A trainable diffusion denoiser tower, with bidirectional block attention, refines noisy blocks and pulls context in via cross-attention. The AR tower is not just a warm start, it stays frozen through training.
The base model is Nemotron-3-Nano-30B-A3B, described as an open-weight 30B hybrid Mamba-Transformer MoE. The diffusion tower is trained on approximately 2.1 trillion tokens. The two headline numbers the authors report are 98.7% of the autoregressive baseline's quality retained, and 2.42x higher wall-clock generation throughput. Weights have been posted under the Nemotron-TwoTower Hugging Face collection.
Why this might matter beyond the leaderboard question: diffusion LMs generate blocks in parallel rather than one token at a time, which is the mechanism behind the throughput number. If the two-tower recipe generalizes, teams that already have a strong autoregressive checkpoint could plausibly bolt on a diffusion denoiser and get a serving speedup without retraining the full backbone. That is a different cost story than most current inference optimizations, which mostly squeeze the AR decoder itself.
The honest caveat is that the 98.7% and 2.42x figures come from the authors' own evaluation on their own baseline, and the paper is fresh. What the abstract doesn't give you is a benchmark-by-benchmark breakdown of where quality slips, or how the throughput number holds up across batch sizes, sequence lengths, and hardware. If it survives independent testing on hard reasoning and code, the more interesting downstream question is which other frozen AR backbones this trick works on.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context