huggingface.co web signal

SJTU-Huawei paper speeds diffusion LLMs with multi-block training

TL;DR

  • MBD-LLaDA2-Mini raises average Tokens Per Forward from 3.47 to 6.19 while lifting accuracy from 79.95% to 81.03% on math and code benchmarks.
  • Adding DMax pushes average TPF to 9.34 and throughput from 781.50 to 951.41 tokens per second, with a 1.02 percentage-point accuracy drop.
  • Multi-block Teacher Forcing is applied as post-training to existing Block Diffusion LMs, paired with a Block Buffer inference pipeline.

Diffusion language models keep chipping away at the throughput gap with autoregressive systems, and a new paper from Shanghai Jiao Tong, Xi'an Jiao Tong, and Huawei is a small but concrete step in that direction. Posted to Hugging Face's paper feed at the end of June, "Multi-Block Diffusion Language Models" attacks the specific bottleneck of Block Diffusion LMs, where blocks of tokens still get processed one after another even though the model can denoise within a block in parallel.

The team's fix is a post-training recipe called Multi-block Teacher Forcing, or MultiTF, which turns an existing Block Diffusion LM into what the authors call an MBD-LM. The idea is straightforward. Existing block diffusion training mostly shows the model one noisy block at a time, but efficient inference wants to overlap several. MultiTF trains on bounded groups of consecutive noisy blocks with randomized noise schedulers, so the training state better matches what the model actually sees at decoding time. Alongside the training change, they add a Block Buffer mechanism at inference that keeps input shapes static, preserves KV and prefix caching, and stays friendly to CUDA Graph capture and replay.

The numbers are on math and code benchmarks against LLaDA2-Mini. MBD-LLaDA2-Mini raises the average Tokens Per Forward pass from 3.47 to 6.19, a 78.4% jump, and average accuracy actually nudges up from 79.95% to 81.03%. Layering the DMax accelerator on top pushes average TPF to 9.34, a 47.1% gain over LLaDA2-Mini-DMax under single-block decoding, at the cost of a 1.02 percentage-point accuracy drop. In wall-clock terms the authors report 951.41 tokens per second on average for MBD-LLaDA2-Mini-DMax against 781.50 for the LLaDA2-Mini-DMax baseline, using their own inference engine.

The honest caveat is that the evaluation is scoped to math and code, so how the method behaves on open-ended text, long-form reasoning, or general benchmarks isn't in the paper. The scale is also modest, this is LLaDA2-Mini rather than a frontier-sized model, and the paper doesn't quantify the extra compute the MultiTF post-training step itself costs. A double-digit TPS gain over a specific baseline is not the same as parity with the strongest autoregressive serving stacks.

Still, the pattern is worth watching if you care about diffusion LMs as a serving option rather than a research curio. Post-training tweaks that translate cleanly into throughput without eating quality are what makes a research paradigm start looking like an engineering one, and the Block Buffer trick is the kind of pragmatic inference-plumbing detail that tends to get copied.