How to convert the quantized model to tensorrt for GPU inference

I ran quantized aware training in pytorch and convert the model into quantized with I know pytorch does not yet support the inference of the quantized model on GPU, however, is there a way to convert the quantized pytorch model into tensorrt?

I tried torch-tensorrt following the guide on pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT ( However, the conversion failed with following errors:

I don’t think you can just take an eager mode quantized model and lower it to trt, trt also has its own quantization stuff ( that might work, but the native pytorch to trt lowering is still an early prototype atm (Quantization — PyTorch main documentation) maybe @jerryzh168 if there’s an easy way to do this atm?

see the example here:

youu can try that, though it uses fx quantization, not eager mode.

I already used fx quantization, but the conversion still failed. Is it because the pytorch quanization functionality is at early stage? Will the conversion of pytorch quantized model to tensorrt be easier in the future? Do you have any recommended tools for pytorch quantization and tensorrt distribution now?

do those tests I linked pass for your setup?

if they do then there’s something about your model thats an issue, if not then its something in your setup.

if its the former, if you provide a repro and we can take a deeper look. If its the latter, you’d probably need to ask the TensorRT folks.

The test failed.

It seems that the instancenorm module of pytorch contains conditional sentences if input.dim() not in (3, 4) that cannot be traced. Is it a bug of pytorch itself?

this is expected, see: torch.fx — PyTorch 2.0 documentation

you’ll need to modify the model in order to make it traceable by following: (prototype) FX Graph Mode Quantization User Guide — PyTorch Tutorials 2.0.1+cu117 documentation

Thanks for those who responded to my question.

I gave up using pytorch’s own quantization stuff.

I finally successfully quantized my model and converted it into onnx and then tensorrt with package pytorch-quantization · PyPI, onnx and NVIDIA/TensorRT: NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. (

Hopefully quantization and its deployment can be easier within pytorch in the future.

1 Like