reddit.com via Reddit May 20th 2026

Qwen 3.5 122B hits 40 tok/s on single DGX Spark

nvidia inference open source local-inference hardware-optimization

Key insights

Qwen 3.5 122B at Int4 achieves 40+ tokens/sec on a single Nvidia DGX Spark, meeting a production throughput threshold.
The fully reproducible vLLM recipe covers all context lengths, making community replication and verification straightforward.
This is among the first validations that a sub-rack desktop AI server can serve 100B+ parameter models at viable speed.

Why this matters

For AI infrastructure teams evaluating on-premise deployments, this result resets the cost-per-token calculus for large models: a single $10,000-range desktop server can now serve a 122B model at speeds previously requiring multi-GPU clusters. For founders building on top of open-weight frontier models, it means self-hosted inference at this scale is no longer a multi-node engineering problem. For Nvidia, it provides third-party validation of the DGX Spark's positioning as a serious inference appliance rather than a developer toy, which directly supports enterprise sales against cloud-API alternatives.

Summary

A community developer on r/LocalLLaMA has published a fully reproducible vLLM configuration that pushes Qwen 3.5 122B at Int4 quantization past 40 tokens per second on a single Nvidia DGX Spark, claiming the fastest recorded throughput in Spark Arena benchmarks across every tested context length and concurrency level. The DGX Spark is Nvidia's 1,000W desktop AI server, and this result is one of the first community-validated proofs that the hardware can serve a 100B+ parameter model at throughput rates viable for real workloads, not just demos. The post includes the complete vLLM flags and model-loading parameters needed for anyone to replicate the setup. Essentially: (Nvidia, Alibaba/Qwen team) now have third-party validation that their respective hardware and model stack can meet a production bar on compact, single-node infrastructure. - Qwen 3.5 122B at Int4 precision fits and runs on DGX Spark without multi-node orchestration overhead - The published recipe covers all context lengths, suggesting the throughput holds even at longer sequences where many quantized models degrade - Spark Arena is a community benchmark leaderboard, making the claim independently verifiable by other DGX Spark owners The broader shift here is that production-grade inference for frontier-scale models is moving within reach of single-machine, on-premise deployments.

Potential risks and opportunities

Risks

If Spark Arena benchmarks lack standardized methodology, competing configurations could dispute the top-speed claim and erode trust in community inference leaderboards broadly
Enterprises deploying this recipe in production could hit reliability issues if the vLLM flags exploit edge-case optimizations that break on future vLLM or driver updates
Qwen model owners relying on Int4 quantization for on-premise deployments may face compliance exposure if their use cases require accuracy guarantees the quantized model cannot provide

Opportunities

Nvidia's DGX Spark sales team gains a concrete community benchmark to cite in enterprise pitches against cloud inference pricing, particularly for customers with data-residency requirements
vLLM contributors and inference optimization consultants (Anyscale, Baseten, Modal) can build on the published recipe to offer turnkey DGX Spark deployment services
Alibaba's Qwen team benefits from expanded community adoption evidence for Qwen 3.5 at scale, strengthening its position against Llama and Mistral in the open-weight enterprise market

What we don't know yet

Whether the 40+ tok/s figure holds under sustained multi-user concurrency or drops under real queue pressure beyond the benchmark conditions
What the actual quantization quality loss looks like on standard evals for Qwen 3.5 122B at Int4 relative to BF16, which the post does not address
Whether Nvidia has confirmed or reproduced these Spark Arena results internally, or whether the leaderboard claim remains solely community-sourced as of May 2026

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Developer Publishes Optimized vLLM Recipe for Qwen 3.5 122B Int4 on Single DGX Spark — Claims Top Spark Arena Speed Across All Context Lengths