r/LocalLLaMA: INT8 Q/DQ on Blackwell Beats TRT 10 Auto-FP16 by 1.8× — Mandatory Explicit Quantization Path Hits Dedicated 5th-Gen Tensor Core Pipeline
Summary
A developer reports that TensorRT 11's mandatory INT8 Q/DQ (explicit quantization) path on Nvidia Blackwell GPUs outperforms TRT 10's auto-FP16 builds by 1.8× by routing workloads through the architecture's dedicated 5th-generation Tensor Core INT8 path that auto-FP16 never reaches. The result was achieved via proper post-training quantization (PTQ) on a 188MB FP32 ONNX model, with the developer publishing a practical calibration writeup for practitioners upgrading to TRT 11. The finding is relevant to production inference teams migrating to Blackwell hardware who assumed auto-FP16 was the path of least resistance.
Originally reported by Reddit
Read the original article →Original headline: r/LocalLLaMA: INT8 Q/DQ on Blackwell Beats TRT 10 Auto-FP16 by 1.8× — Mandatory Explicit Quantization Path Hits Dedicated 5th-Gen Tensor Core Pipeline