huggingface.co web signal

Alibaba AMAP BlockPilot hits 4.20x speedup on Qwen3-4B decoding

TL;DR

  • Alibaba's AMAP team argues fixed inference block size in diffusion speculative decoding is suboptimal because the ideal block size varies per input.
  • BlockPilot uses a two-layer MLP on the prefill last-token distribution to pick a block size in {B-k,...,B+k} with k=2, run once per request.
  • On Qwen3-4B they report 4.20x speedup and acceptance length 5.92 at temperature 1, and 4.66x on Qwen3-8B at temperature 0.

Speculative decoding research has spent most of its energy on stronger draft models and cleverer verification. A new paper from Alibaba's AMAP team, posted on Hugging Face, argues the fixed block size everyone inherits from training is quietly leaving speedup on the table.

The pitch is narrow. Diffusion-based drafters such as DFlash generate a block of candidate tokens in parallel and let the target model verify them. Existing methods pick one block size at training time and use it for every input. The authors' claim is that the optimal block size actually varies across samples, and that swapping the fixed choice for a per-input prediction improves both acceptance length and end-to-end speedup.

Rather than running a heavy online search, the authors show the optimal block size stays within a narrow window around the training value, so they treat it as a small discrete classification problem. A two-layer MLP with hidden size 2048 reads the predictive distribution of the last token after prefilling and picks a block size from a local neighborhood {B-k, ..., B+k} with k=2. The prediction runs only once per request. On Qwen3-4B they report an acceptance length of 5.92 and a 4.20x speedup at temperature 1, and 4.17x at temperature 0. On Qwen3-8B the numbers are 4.66x and 3.94x across the two temperature settings, tested on math, code and chat benchmarks including GSM8K, MATH-500, HumanEval, MBPP, SWE-Bench and MT-Bench on NVIDIA H100 80GB GPUs. Code has been released at the AMAP-ML GitHub repo.

The honest caveats are the usual ones for a research paper. The comparisons are against DFlash and EAGLE-3 in an academic harness, not a production serving stack with continuous batching, so real-world numbers could look different. The predictor is trained on ShareGPT, WSC and COPA, and what the paper doesn't tell you is how the classifier holds up on workloads far from that mix, or how the small extra memory footprint behaves at high concurrency.

Still, the framing is the interesting part. Treating the decoding strategy itself as a learnable component, rather than only the drafter, is a cheap knob to turn without touching the target model or the verification path. If it generalizes, the people who benefit soonest are teams already serving open Qwen3 or Llama-family models on their own hardware.