github.com web signal

DeepSeek open-sources DeepSpec speculative decoding toolkit

TL;DR

  • DeepSpec is an MIT-licensed full-stack codebase for training and evaluating draft models for speculative decoding.
  • In production on DeepSeek-V4, DSpark reportedly runs per-user generation 60 to 85 percent faster than the MTP-1 baseline.
  • Released checkpoints target Qwen3 at 4B, 8B, 14B and Gemma-4-12B-it, with default configs assuming a single 8-GPU node.

The interesting thing about DeepSeek's latest drop is not the drafter model itself but the fact that the entire training and evaluation pipeline came with it. On GitHub the company published DeepSpec, described in the README as a full-stack codebase for training and evaluating draft models for speculative decoding, released under the MIT License. It ships three drafter architectures called DSpark, DFlash, and Eagle3, plus pre-trained checkpoints and the machinery for data prep, training, and benchmarking.

Speculative decoding is a serving trick. A small draft model guesses several tokens ahead, and the big target model verifies them in a batch. When the guesses are good, per-user latency drops. DeepSeek's headline claim, reported by MarkTechPost, is that in production on DeepSeek-V4, per-user generation runs 60 to 85 percent faster than the MTP-1 baseline. Against Eagle3 as an offline comparison, macro-average accepted length rises 30.9, 26.7, and 30.0 percent on the three Qwen3 sizes.

For anyone serving open weights, the released checkpoints are the useful part. DeepSpec targets Qwen3-4B, 8B, and 14B and Gemma-4-12B-it, which covers a large slice of the current open-model middle of the market. The nine evaluation tasks span math (GSM8K, MATH500, AIME25), code (HumanEval, MBPP, LiveCodeBench), and chat (MT-Bench, Alpaca, Arena-Hard-v2), so you can inspect where accepted length holds up and where it falls apart.

The honest caveat is that the pipeline is heavy. The README notes that the default configs and scripts assume a single node with 8 GPUs, and the target cache runs roughly 38 TB for the default Qwen3-4B setting alone. The 60 to 85 percent figure is measured against MTP-1, DeepSeek's existing multi-token-prediction drafter, not against naive decoding, so what you would see swapping DSpark into a different serving stack is not directly the same number.

What the reporting doesn't yet give you is how the drafter behaves under heavy concurrency or how the storage footprint scales for the 14B target. But for teams building inference infrastructure on open models, a working reference for training your own drafter, from a lab that actually runs these systems at scale, is worth the pipeline weight.

Shared on Bluesky by 2 AI experts