reddit.com via Reddit

Qwen 3.5 122B hits 40 tok/s on single DGX Spark

nvidia inference open source local-inference hardware-optimization

Key insights

  • Qwen 3.5 122B at Int4 achieves 40+ tokens/sec on a single Nvidia DGX Spark, meeting a production throughput threshold.
  • The fully reproducible vLLM recipe covers all context lengths, making community replication and verification straightforward.
  • This is among the first validations that a sub-rack desktop AI server can serve 100B+ parameter models at viable speed.

Why this matters

For AI infrastructure teams evaluating on-premise deployments, this result resets the cost-per-token calculus for large models: a single $10,000-range desktop server can now serve a 122B model at speeds previously requiring multi-GPU clusters. For founders building on top of open-weight frontier models, it means self-hosted inference at this scale is no longer a multi-node engineering problem. For Nvidia, it provides third-party validation of the DGX Spark's positioning as a serious inference appliance rather than a developer toy, which directly supports enterprise sales against cloud-API alternatives.

Summary

A community developer on r/LocalLLaMA has published a fully reproducible vLLM configuration that pushes Qwen 3.5 122B at Int4 quantization past 40 tokens per second on a single Nvidia DGX Spark, claiming the fastest recorded throughput in Spark Arena benchmarks across every tested context length and concurrency level. The DGX Spark is Nvidia's 1,000W desktop AI server, and this result is one of the first community-validated proofs that the hardware can serve a 100B+ parameter model at throughput rates viable for real workloads, not just demos. The post includes the complete vLLM flags and model-loading parameters needed for anyone to replicate the setup. Essentially: (Nvidia, Alibaba/Qwen team) now have third-party validation that their respective hardware and model stack can meet a production bar on compact, single-node infrastructure. - Qwen 3.5 122B at Int4 precision fits and runs on DGX Spark without multi-node orchestration overhead - The published recipe covers all context lengths, suggesting the throughput holds even at longer sequences where many quantized models degrade - Spark Arena is a community benchmark leaderboard, making the claim independently verifiable by other DGX Spark owners The broader shift here is that production-grade inference for frontier-scale models is moving within reach of single-machine, on-premise deployments.

Potential risks and opportunities

Risks

  • If Spark Arena benchmarks lack standardized methodology, competing configurations could dispute the top-speed claim and erode trust in community inference leaderboards broadly
  • Enterprises deploying this recipe in production could hit reliability issues if the vLLM flags exploit edge-case optimizations that break on future vLLM or driver updates
  • Qwen model owners relying on Int4 quantization for on-premise deployments may face compliance exposure if their use cases require accuracy guarantees the quantized model cannot provide

Opportunities

  • Nvidia's DGX Spark sales team gains a concrete community benchmark to cite in enterprise pitches against cloud inference pricing, particularly for customers with data-residency requirements
  • vLLM contributors and inference optimization consultants (Anyscale, Baseten, Modal) can build on the published recipe to offer turnkey DGX Spark deployment services
  • Alibaba's Qwen team benefits from expanded community adoption evidence for Qwen 3.5 at scale, strengthening its position against Llama and Mistral in the open-weight enterprise market

What we don't know yet

  • Whether the 40+ tok/s figure holds under sustained multi-user concurrency or drops under real queue pressure beyond the benchmark conditions
  • What the actual quantization quality loss looks like on standard evals for Qwen 3.5 122B at Int4 relative to BF16, which the post does not address
  • Whether Nvidia has confirmed or reproduced these Spark Arena results internally, or whether the leaderboard claim remains solely community-sourced as of May 2026