I ran quantized aware training in pytorch and convert the model into quantized with
torch.ao.quantization.convert. I know pytorch does not yet support the inference of the quantized model on GPU, however, is there a way to convert the quantized pytorch model into tensorrt?
torch-tensorrt following the guide on
pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT (github.com). However, the conversion failed with following errors:
I don’t think you can just take an eager mode quantized model and lower it to trt, trt also has its own quantization stuff (
https://github.com/pytorch/TensorRT/tree/main/examples/int8/ptq) that might work, but the native pytorch to trt lowering is still an early prototype atm ( Quantization — PyTorch main documentation) maybe @jerryzh168 if there’s an easy way to do this atm?
see the example here:
prepared = prepare(
quantized = convert_to_reference_fx(
# lower to trt
trt_mod = lower_to_trt(quantized, inputs, shape_ranges)
youu can try that, though it uses fx quantization, not eager mode.
I already used fx quantization, but the conversion still failed. Is it because the pytorch quanization functionality is at early stage? Will the conversion of pytorch quantized model to tensorrt be easier in the future? Do you have any recommended tools for pytorch quantization and tensorrt distribution now?
do those tests I linked pass for your setup?
if they do then there’s something about your model thats an issue, if not then its something in your setup.
if its the former, if you provide a repro and we can take a deeper look. If its the latter, you’d probably need to ask the TensorRT folks.
The test failed.
It seems that the
instancenorm module of pytorch contains conditional sentences
if input.dim() not in (3, 4) that cannot be traced. Is it a bug of pytorch itself?
Thanks for those who responded to my question.
I gave up using pytorch’s own quantization stuff.
I finally successfully quantized my model and converted it into onnx and then tensorrt with package
pytorch-quantization · PyPI,
NVIDIA/TensorRT: NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. (github.com)
Hopefully quantization and its deployment can be easier within pytorch in the future.