I need to quantize my model to INT8 using either PTQ or QAT or both and finally run inference on GPU using tensorrt.
I have seen the static quantization page which says quantization is only available on CPU. Is it still the case? Is there any way to achieve this on GPU?
I have tried the pytorch-quantization toolkit from torch-tensorrt using fake quantization. However, after compiling the exported torchscript using torch.int8, my model size and inference speed are the same as that with FP16.
Please let me know if there is some example/ blog explaining this.