reddit.com via Reddit

RTX PRO 6000 Blackwell gets first Qwen 3.6 LLM benchmarks

nvidia inference open source local-llm inference hardware

Key insights

  • Each RTX PRO 6000 Blackwell card provides 96GB GDDR7 ECC and 1.8 TB/s bandwidth, totaling 192GB across two cards.
  • Qwen 3.6 27B ran at full BF16 precision unquantized, and the 35B A3B MoE variant was also tested under VLLM.
  • These are the first publicly available LLM inference benchmarks for the PRO 6000, a GPU tier previously lacking community throughput data.

Why this matters

The PRO 6000 occupies a pricing and capability tier between consumer RTX 5090s and datacenter H100s that enterprises consider for on-prem inference, and the absence of community benchmark data has made hardware procurement decisions largely speculative until now. AI labs and founders evaluating 192GB on-prem workstation builds for serving unquantized frontier models can now compare real VLLM throughput against cloud H100 cost-per-token before committing. The benchmark also establishes VLLM on Blackwell professional GPUs as a viable reference configuration, which shapes how inference stack vendors prioritize driver and backend optimization for this hardware class.

Summary

RTX PRO 6000 Blackwell, Nvidia's enterprise workstation GPU priced above consumer RTX 5090s, now has its first community LLM inference benchmarks. A LocalLLaMA developer ran Qwen 3.6 in two configurations on a dual-card setup: 27B BF16 unquantized and 35B A3B MoE, using the latest stable VLLM backend. Each PRO 6000 carries 96GB GDDR7 ECC and 1.8 TB/s bandwidth, giving the two-card rig 192GB total for full-precision inference without quantization compromises. Essentially: (Nvidia, Alibaba/Qwen, VLLM) community benchmarks are filling a reference gap that neither Nvidia nor cloud vendors have published for this hardware tier. - 192GB combined VRAM enables unquantized BF16 on 27B models with context headroom to spare - The 35B A3B MoE config tests whether expert-routing efficiency translates to real workstation-class throughput - PRO 6000 sits between consumer RTX 5090 and datacenter H100, with no prior public LLM throughput data Enterprise buyers now have a first cost-per-token reference for this tier before committing capital to on-prem deployments.

Potential risks and opportunities

Risks

  • Enterprise buyers who pre-purchased PRO 6000 units for LLM inference could face budget justification pressure if throughput-per-dollar trails cloud H100 spot pricing once full benchmark data surfaces.
  • VLLM may not yet fully exploit PRO 6000's GDDR7 ECC memory bandwidth, meaning published results could understate real performance and create misleading procurement baselines until backend updates ship.
  • Qwen license terms for commercial on-prem deployment remain a legal uncertainty for enterprises building production pipelines around these benchmark configurations, particularly outside China.

Opportunities

  • Workstation OEMs (Dell, HP Z-series, Lenovo ThinkStation) can use these benchmarks as sales collateral to position dual-PRO 6000 systems directly at AI labs evaluating on-prem inference alternatives to cloud.
  • VLLM and competing inference backends (TensorRT-LLM, SGLang) have a concrete optimization target in PRO 6000's GDDR7 ECC profile, with first-mover throughput improvements likely to drive workstation enterprise adoption.
  • Alibaba/Qwen gains independent third-party validation of its 35B MoE architecture's practical inference efficiency on non-datacenter hardware, strengthening its positioning against Meta Llama and Mistral for on-prem enterprise deployments.

What we don't know yet

  • Actual tokens-per-second figures for both model configurations were not included in the summary, leaving cost-per-token comparisons against H100 cloud instances unresolved.
  • Whether VLLM's current tensor parallelism implementation fully saturates the PRO 6000's 1.8 TB/s inter-card bandwidth or leaves meaningful throughput on the table pending software updates.
  • How PRO 6000 performance-per-dollar compares to consumer RTX 5090 SLI alternatives at current retail pricing for buyers without enterprise support requirements.