huggingface.co web signal

DeepSeek Releases V4-Pro-DSpark: 1.6T-Param MoE, MIT License

deepseek open source inference open-source model-release inference

TL;DR

  • DeepSeek-V4-Pro-DSpark is a 1.6-trillion-parameter MoE model with 49B active parameters, a 1M-token context window, and an MIT license.
  • The DSpark speculative decoding module reduces inference to 27% of DeepSeek-V3.2's FLOPs at 1M-token context, using only 10% of its KV cache.
  • Reported benchmark scores include 93.5% on LiveCodeBench (highest among listed peers), 80.6% on SWE Verified, and a 3206 Codeforces rating.

Long-context inference has always had a practical ceiling: a million-token context window is not very useful if the cost to fill it is prohibitive. That is why the most operationally significant detail in DeepSeek's release of V4-Pro-DSpark on HuggingFace is not the MIT license or the headline parameter count -- it is the efficiency claim attached to the "-DSpark" suffix.

DSpark is not a new model. It is the standard DeepSeek-V4-Pro checkpoint -- 1.6 trillion total parameters, 49 billion activated per forward pass, one million token context window -- with a speculative decoding module from the DeepSpec project added for inference acceleration. According to the model card, this variant requires only 27% of the single-token inference FLOPs that DeepSeek-V3.2 uses at 1M-token context, and uses just 10% of its KV cache. Those numbers, if they hold on real workloads, materially change the unit economics of serving long-context requests at scale.

The underlying architecture uses a hybrid attention scheme combining Compressed Sparse Attention and Heavily Compressed Attention, along with Manifold-Constrained Hyper-Connections for signal propagation stability across the deep stack. The model was pre-trained on more than 32 trillion tokens and uses FP4 and FP8 mixed precision -- experts in FP4, other parameters in FP8 -- a combination that can improve throughput but constrains which hardware configurations can run it efficiently, a detail the model card does not fully spell out.

The benchmark scores attached to the release are for a variant called DeepSeek-V4-Pro-Max rather than the DSpark checkpoint specifically, which is worth noting before treating the efficiency and quality numbers as a matched pair. Among those reported figures: 93.5% on LiveCodeBench (described as the highest among compared peers), 80.6% on SWE Verified, and a Codeforces rating of 3206. The model trails on some axes -- MMLU-Pro is 87.5% against 89.1% for Opus-4.6 Max, and SimpleQA-Verified at 57.9% falls well behind Gemini-3.1 at 75.6%. What the release does not give you is a direct quality comparison between DSpark and the non-speculative variant, which is the most important number for anyone evaluating whether the efficiency tradeoff is worthwhile.

The MIT license and immediate support for vLLM and SGLang mean commercial deployment and community fine-tuning can begin now. Inference providers absorbing the cost of long-context requests, and teams building legal, scientific, or large-codebase pipelines that actually need the context length without negotiating API terms, are the clearest near-term beneficiaries. How quickly those teams can access the efficiency gains in practice depends on whether the FP4+FP8 precision requirements are achievable on the hardware they already have.

Shared on Bluesky by 1 AI expert