Hi, the following question is unclear to me, and I was not able to find an answer in Pytorch and on the forum. I have found a similar question on this forum but it is still unclear: “what-is-the-difference-between-pytorch-quantization-and-torch-ao-quantization”
What is the recommended and future-proof way to use (PyTorch naïve quantization) when deploying to int8 TensorRT?
I am curious about two types of “TensorRT deployment”:
- The model should be deployed and able to be used with Nvidia tools such as NVIDIA Triton Inference Server
- Torch-TensorRT
Currently, I am using Nvidia’s PyTorch Quantization tool which is also in PyTorch’s documentation: DEPLOYING QUANTIZATION AWARE TRAINED MODELS IN INT8 USING TORCH-TENSORRT. With this tool, I can deploy models to Torch-TensorRT or via ONNX → to TensorRT (although there are some limitations).
But currently, Nvidia’s PyTorch Quantization tool gives a depreciated warning and suggests using PyTorch naive quantization. But from PyTorch Quantization it is unclear which future-proof route I should follow, they are all focused on CPU deployment. For me, it is not a problem if I first need to convert the model to ONNX and then to TensorRT. Or should I now (still) stay at Nvidia’s PyTorch Quantization tool?