huggingface.co web signal July 1st 2026

KAIST, Microsoft cut MoE decode latency with expert-locality routing

microsoft inference ai infrastructure ai-research

TL;DR

ELDR routes decode requests by the experts prefill activated, cutting median TPOT 7.0-13.9% on task workloads versus the best load-balancing baseline.
It beats an oracle domain-labeled baseline by 1.4-6.9% median TPOT, showing signature-based routing captures more structure than hand-labeled domains.
Overhead is 0.86 ms per request, roughly 1.2% of the 69 ms median TTFT, implemented in about 2,000 lines of Python on vLLM.

A quiet piece of MoE serving research from KAIST and Microsoft, posted to arXiv via Hugging Face, argues that decode routing in disaggregated mixture-of-experts stacks has been solving the wrong problem. The dominant approach routes decode requests to whichever worker has the shortest queue. ELDR, short for Expert-Locality-Aware Decode Routing, instead routes by which experts a request's prefill phase activated, on the observation that MoE decode latency is governed by the union of distinct experts a batch touches, not just per-worker load.

The empirical hook is that prefill and decode phases activate strikingly similar experts, with a Spearman correlation reported between 0.70 and 0.92 across the three models tested. That lets the system build an "expert signature" during prefill, cluster signatures across decode workers with balanced K-means, and then route with a locality band that still respects live load. On task-mixed workloads spanning legal, code, math and medical prompts, median time-per-output-token drops 7.0 to 13.9% versus the best load-balancing baseline. On WildChat language traffic the reduction is 5.9 to 10.0%. Same-domain batches activate 17 to 21% fewer experts per step on the task workload, which is where the latency comes from.

The authors evaluate on Qwen3-30B-A3B, GPT-OSS-120B, and Gemma-4-26B-A4B in an 8-prefill, 16-decode topology on 24 AMD MI300X GPUs across three nodes, with a larger Qwen3-235B-A22B study on 40 GPUs. Implementation is around 2,000 lines of Python on vLLM 0.21.0rc1, and the per-request routing overhead is 0.86 ms, roughly 1.2% of the 69 ms median TTFT. Notably, ELDR beats an oracle baseline that used ground-truth domain labels by 1.4 to 6.9% on median TPOT, which suggests the learned signature captures structure that hand-labeled domains miss.

The honest caveat is that the reported gains ride on a specific stack, AMD MI300X with ROCm 7.2 and 400 Gbps InfiniBand, and the large 235B expert-parallel deployment only saw 2.7 to 4.3% median TPOT improvement, so the win compresses at scale. The paper also does not discuss throughput, cost per token, or behavior under bursty adversarial traffic, and signature capture takes 4 to 15 minutes per model/dataset pair, which is a real friction in multi-tenant serving.

For teams running open MoE models on vLLM the interesting question is how fast this becomes upstream code, because it is the rare inference-side change that does not touch weights, does not alter expert selection, and slots next to prefill-decode disaggregation rather than competing with it.

Originally reported by huggingface.co

Read the original article →

Original headline: ELDR (KAIST + Microsoft Research): Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving Cuts Median TPOT 7-14% Across Qwen3, GPT-OSS, Gemma-4