dnhkng.github.io via Reddit June 8th 2026

Canada-Quant Hits 193 Tok/s on Dual GH200 Node

deepseek inference ai-business

Key insights

Canada-Quant's W4A16/FP8 checkpoint achieves 193 tok/s at MTP3, outperforming the official DeepSeek V4 Flash checkpoint's 149.5 tok/s peak by 29%.
MTP acceptance rates drop from 71.1% at level 3 to 47.9% at level 4, making MTP4 actively harmful to throughput on this hardware.
The dual GH200's cross-GPU path bottlenecks at 57-58 GB/s; disabling NCCL P2P and keeping decode in HBM is essential for peak performance.

Why this matters

Community quantized checkpoints are now definitively outperforming official model releases on the same hardware, meaning operators who rely solely on official checkpoints are leaving 29% throughput on the table. Ng's upstream vLLM PR #44847 closes a gap that was silently breaking MTP for an entire class of checkpoints missing FP8 scale metadata, with direct implications for anyone deploying W4A16/FP8-quantized models. The finding that MTP4 degrades throughput relative to MTP3 due to acceptance rate collapse challenges the assumption that higher speculative decoding levels are always better, requiring workload-specific profiling rather than defaulting to maximum MTP settings.

Summary

David Noel Ng benchmarked DeepSeek V4 Flash on a dual GH200 workstation and found quantization choice and MTP level tuning matter as much as raw hardware. Canada-Quant's W4A16/FP8 checkpoint hit 193 tok/s at MTP3 versus 149.5 tok/s for the official DeepSeek release -- a 29% gap partly explained by three vLLM bugs Ng fixed and submitted as upstream PR #44847: metadata naming mismatches, a missing prefix= parameter in MTP module construction, and a missing BF16 fallback for the O-projection when FP8 scale metadata is absent. Essentially: (Canada-Quant, vLLM) community checkpoints now measurably outrun official ones on identical hardware. - MTP4 degrades throughput to 162.1 tok/s as acceptance rates fall to 47.9%, versus 193.0 tok/s at MTP3. - Cross-GPU bandwidth on the dual GH200 caps at 57-58 GB/s, making NCCL_P2P_DISABLE=1 and HBM-local decode essential. Ng's reproducible configuration guide lowers the barrier for other dual-Hopper operators to capture the same gains.

Potential risks and opportunities

Risks

Operators running the official DeepSeek V4 Flash checkpoint without Ng's vLLM patches lose MTP functionality entirely, capping throughput at 64.9 tok/s versus a reachable 193.0 tok/s on the same hardware.
Deployments defaulting to MTP4 will see throughput regress to 162.1 tok/s due to the 47.9% acceptance rate, a regression invisible to standard correctness tests.
Any workload forcing data movement across the 57-58 GB/s Hopper-to-Hopper bottleneck will see severe throughput degradation, making dual GH200 nodes unreliable without topology-aware tuning.

Opportunities

Operators running dual GH200 nodes can capture a 29% throughput gain immediately by switching to the Canada-Quant W4A16/FP8 checkpoint and applying the NCCL_P2P_DISABLE=1 configuration from Ng's guide.
vLLM contributors who extend Ng's BF16 O-projection fallback pattern to other quantized checkpoint formats can unlock MTP for a broader class of community models currently blocked by the same metadata mismatch.
Inference tooling vendors targeting NVIDIA Grace Hopper deployments could package topology-aware MTP level selection and NCCL tuning as automated profiling features, given how non-obvious the optimal configuration is.

What we don't know yet

Whether the 29% throughput advantage of Canada-Quant over the official checkpoint holds at higher request concurrency, since all benchmarks used max_num_seqs=1 with a single request.
Whether vLLM PR #44847 has merged to main and in which release the BF16 O-projection fallback will ship for general availability.
How optimal MTP level selection changes across different prompt distributions beyond the single 8192-token benchmark workload used in this guide.

Originally reported by dnhkng.github.io

Read the original article →

Original headline: r/LocalLLaMA: Engineering Guide Achieves 193 Tok/s on DeepSeek V4 Flash on Dual GH200 — Canada-Quant W4A16/FP8 Beats Official Checkpoint by 29%