Run quantized model on GPU

Hi
I want to run inference on a quantized model using GPU, but it only works on CPU.

I have quantized a pytorch nn model using quantize_dynamic_jit and torch.jit.trace. It performs int8 quantization on the linear layers. It has reduced the size of the model with approximately 71% and it is still very accurate. The problem is I only seem to be able to run the inference using CPU and not GPU, so the original model still outperforms the quantized. Thus right now im only using CPU to quantize and for inference. I believe its only possible to use the CPU for quantization though.

I have also tried to load the model as a PyTorch NN module instead of a TorchScript, but it seems the model architecture is changed.
Any help is greatly appreciated

Our GPU quantization support is in ao GitHub - pytorch-labs/ao: torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch. - repo is still under heavy development but please feel free to ping me if you run into any issues