github.com via Hacker News

DeepSeek Open-Sources DeepSpec Speculative Decoding Stack

deepseek inference open source china ai inference open-source-ai speculative-decoding

TL;DR

  • DeepSpec ships three draft-model algorithms (DSpark, DFlash, Eagle3) with a full data-prep, training, and evaluation pipeline targeting Qwen3 and Gemma model families.
  • DSpark reportedly increases user generation speeds by 60-85% on the Flash model and 57-78% on the Pro model versus the prior MTP-1 baseline.
  • The default training configuration requires a single 8-GPU node and roughly 38 TB of storage for the target cache.

Speculative decoding is among the most practical tools for cutting LLM inference costs without touching model quality: a lightweight draft model proposes tokens, the larger target model verifies a batch of them in parallel, and wall-clock time drops. DeepSeek has now open-sourced DeepSpec, a full-stack codebase for training and evaluating draft models, shipping with three algorithms (DSpark, DFlash, and Eagle3) alongside data-preparation scripts, multi-GPU training pipelines, and evaluation across nine benchmarks including GSM8K, MATH500, HumanEval, and LiveCodeBench.

The headline addition is DSpark. According to KuCoin's reporting on the release, it employs a semi-autoregressive generation method and confidence-scheduled validation to reduce GPU stalls, combining high-throughput parallel generation with adaptive load-aware verification. The reported results on DeepSeek's own models are notable: DSpark increases user generation speeds by 60-85% on the Flash model and 57-78% on the Pro model versus the prior MTP-1 baseline. Against Eagle3, average acceptance length improves by 26.7% to 30.9% on Qwen3 series models tested at 4B, 8B, and 14B parameter scales; against DFlash, the improvement is 16.3% to 18.4%.

The training pipeline targets a single node with 8 GPUs and currently supports the Qwen3 and Gemma model families. One concrete constraint: the target cache for the default Qwen3-4B configuration runs to roughly 38 TB of storage, which sets a real floor on who can run the full pipeline.

The honest caveat is that every speedup figure here is benchmarked against DeepSeek's own prior technique on DeepSeek's own infrastructure. What the release does not give you is production throughput data in tokens per second, or evidence that the gains hold outside the two supported model families or beyond DeepSeek's hardware setup.

For teams already running Qwen3 or Gemma, DeepSpec is a direct path to faster inference without retraining the base model. The broader value of releasing the full training codebase rather than just pretrained weights is that researchers can now train custom draft models and benchmark them against the same nine-dataset suite DeepSeek uses internally, which is the kind of shared ground that actually moves a field forward.