ByteDance and Renmin U's iLLaDA 8B Diffusion LM Rivals Qwen2.5 7B Base
TL;DR
- iLLaDA is an 8B masked diffusion LLM trained on 12T tokens with fully bidirectional attention, matching Qwen2.5 7B Base on several benchmarks.
- Compared with LLaDA, iLLaDA-Base gains 21.6 points on BBH and 14.9 on ARC-Challenge; iLLaDA-Instruct gains 14.5 on MATH and 16.5 on HumanEval.
- iLLaDA-Instruct still trails Qwen2.5 7B Instruct on math and code, with reinforcement learning alignment left as explicit future work.
For most of the past few years, the working assumption in language modeling has been that autoregressive training (predicting the next token, left to right, one at a time) is simply the right way to build capable LLMs. A new paper from researchers at Renmin University of China and ByteDance Seed puts that assumption under real pressure.
iLLaDA is an 8B masked diffusion language model trained entirely from scratch using fully bidirectional attention. Rather than predicting the next token, it learns to reconstruct randomly masked tokens across the entire sequence simultaneously. The team scaled pre-training to 12 trillion tokens and fine-tuned on a 25-billion-token instruction corpus for 12 epochs. Key architectural choices include grouped-query attention to reduce memory footprint and tied input/output embeddings to keep parameter count in check.
The benchmark results are the headline claim. Compared with the earlier LLaDA baseline, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge; iLLaDA-Instruct gains 14.5 points on MATH and 16.5 points on HumanEval. Against Qwen2.5 7B, iLLaDA-Base is described as slightly stronger on average and achieves the best results among the compared models on MMLU, BBH, ARC-Challenge, and GSM8K.
The honest caveat is that the instruct comparison is less flattering. iLLaDA-Instruct still trails Qwen2.5 7B Instruct on math and code, a gap the authors attribute partly to Qwen2.5's additional reinforcement learning alignment, which iLLaDA has not yet received. The paper also flags that the instruct model can enter repetitive reasoning loops on hard problems. The study is limited to 8B parameters, so how these results hold at larger scales is a question the paper leaves explicitly for future work.
What the results do establish is meaningful evidence that the autoregressive paradigm is not uniquely necessary for strong language modeling. Bidirectional diffusion models have shown advantages in reversal reasoning and long-horizon planning; iLLaDA adds general benchmark competitiveness to that list. Model weights and code are publicly available at github.com/ML-GSAI/LLaDA, giving teams interested in the diffusion path a stronger open foundation to build from.
Originally reported by huggingface.co
Read the original article →Original headline: iLLaDA: ByteDance Seed and Renmin University Train 8B Fully Bidirectional Diffusion LLM From Scratch on 12T Tokens, Competitive With Qwen2.5 7B on Multiple Benchmarks