Reddit via Reddit

r/LocalLLaMA: INT8 Q/DQ on Blackwell Beats TRT 10 Auto-FP16 by 1.8× — Mandatory Explicit Quantization Path Hits Dedicated 5th-Gen Tensor Core Pipeline

nvidia inference chips inference-optimization blackwell

Summary

A developer reports that TensorRT 11's mandatory INT8 Q/DQ (explicit quantization) path on Nvidia Blackwell GPUs outperforms TRT 10's auto-FP16 builds by 1.8× by routing workloads through the architecture's dedicated 5th-generation Tensor Core INT8 path that auto-FP16 never reaches. The result was achieved via proper post-training quantization (PTQ) on a 188MB FP32 ONNX model, with the developer publishing a practical calibration writeup for practitioners upgrading to TRT 11. The finding is relevant to production inference teams migrating to Blackwell hardware who assumed auto-FP16 was the path of least resistance.