How to convert the quantized model to tensorrt for GPU inference

xuehy · July 21, 2023, 5:58am

I ran quantized aware training in pytorch and convert the model into quantized with torch.ao.quantization.convert. I know pytorch does not yet support the inference of the quantized model on GPU, however, is there a way to convert the quantized pytorch model into tensorrt?

I tried torch-tensorrt following the guide on pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT (github.com). However, the conversion failed with following errors:

HDCharles · July 21, 2023, 10:02pm

I don’t think you can just take an eager mode quantized model and lower it to trt, trt also has its own quantization stuff (https://github.com/pytorch/TensorRT/tree/main/examples/int8/ptq) that might work, but the native pytorch to trt lowering is still an early prototype atm (Quantization — PyTorch main documentation) maybe @jerryzh168 if there’s an easy way to do this atm?

HDCharles · July 21, 2023, 10:39pm

see the example here:

github.com

pytorch/TensorRT/blob/main/py/torch_tensorrt/fx/test/quant/test_quant_trt.py#L429-L444


      
          prepared = prepare(
              m,
              {"": self.trt_qconfig},
              example_inputs,
              backend_config=self.trt_backend_config_dict,
          )
          self.checkGraphModuleNodes(prepared, expected_node_occurrence=no_prepare)
          # calibration
          prepared(*inputs)
          quantized = convert_to_reference_fx(
              prepared,
              backend_config=self.trt_backend_config_dict,
          )
          self.checkGraphModuleNodes(quantized, expected_node_occurrence=no_convert)
          # lower to trt
          trt_mod = lower_to_trt(quantized, inputs, shape_ranges)

youu can try that, though it uses fx quantization, not eager mode.

xuehy · July 24, 2023, 2:02am

I already used fx quantization, but the conversion still failed. Is it because the pytorch quanization functionality is at early stage? Will the conversion of pytorch quantized model to tensorrt be easier in the future? Do you have any recommended tools for pytorch quantization and tensorrt distribution now?

HDCharles · July 24, 2023, 4:06pm

do those tests I linked pass for your setup?

if they do then there’s something about your model thats an issue, if not then its something in your setup.

if its the former, if you provide a repro and we can take a deeper look. If its the latter, you’d probably need to ask the TensorRT folks.

xuehy · July 25, 2023, 6:01am

The test failed.

It seems that the instancenorm module of pytorch contains conditional sentences if input.dim() not in (3, 4) that cannot be traced. Is it a bug of pytorch itself?

jerryzh168 · July 25, 2023, 10:13pm

this is expected, see: torch.fx — PyTorch 2.0 documentation

you’ll need to modify the model in order to make it traceable by following: (prototype) FX Graph Mode Quantization User Guide — PyTorch Tutorials 2.0.1+cu117 documentation

xuehy · July 26, 2023, 5:48am

Thanks for those who responded to my question.

I gave up using pytorch’s own quantization stuff.

I finally successfully quantized my model and converted it into onnx and then tensorrt with package pytorch-quantization · PyPI, onnx and NVIDIA/TensorRT: NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. (github.com)

Hopefully quantization and its deployment can be easier within pytorch in the future.

pytorch-playground · January 8, 2024, 4:36pm

Hi, i know pytorch-quantization package can quantize our float model, but how to convert quantized model to onnx? And this onnx can be build by tensorRT or not ? Or do you have some hack method?

HDCharles · January 11, 2024, 8:01pm

i think when he says pytorch-quantization he means a specific tool documented here: pytorch-quantization master documentation

which is external and not developed by the pytorch ao (quantization+pruning) team.