How to quantize a trained model to INT8 and run inference on GPU


I need to quantize my model to INT8 using either PTQ or QAT or both and finally run inference on GPU using tensorrt.

I have seen the static quantization page which says quantization is only available on CPU. Is it still the case? Is there any way to achieve this on GPU?

I have tried the pytorch-quantization toolkit from torch-tensorrt using fake quantization. However, after compiling the exported torchscript using torch.int8, my model size and inference speed are the same as that with FP16.

Please let me know if there is some example/ blog explaining this.

Best Regards

what are the ops in your model? I think tensorrt would give good speedup for a model with a lot of convs, but not sure for other ops.

pytorch-quantization from torch-tensorrt is a flow from NVidia. we also have a flow from pytorch quantization side, please take a look at tests in: TensorRT/ at master · pytorch/TensorRT · GitHub