Run quantized model on GPU

Eline · March 25, 2024, 1:45pm

Hi
I want to run inference on a quantized model using GPU, but it only works on CPU.

I have quantized a pytorch nn model using quantize_dynamic_jit and torch.jit.trace. It performs int8 quantization on the linear layers. It has reduced the size of the model with approximately 71% and it is still very accurate. The problem is I only seem to be able to run the inference using CPU and not GPU, so the original model still outperforms the quantized. Thus right now im only using CPU to quantize and for inference. I believe its only possible to use the CPU for quantization though.

I have also tried to load the model as a PyTorch NN module instead of a TorchScript, but it seems the model architecture is changed.
Any help is greatly appreciated

marksaroufim · March 25, 2024, 4:38pm

Our GPU quantization support is in ao GitHub - pytorch-labs/ao: torchao: PyTorch Architecture Optimization (AO). A repository to host AO techniques and performant kernels that work with PyTorch. - repo is still under heavy development but please feel free to ping me if you run into any issues

Gadam_Setty_Vedapani · January 23, 2025, 12:21pm

I am also trying to run the dynamic quantized model (which i saved as .pt file) and load the model on the GPU with the below code but the Kernel is dying immediately…
Any Help regarding this will be very helpful

Code:

import time
from transformers import AutoModelForSeq2SeqLM,AutoTokenizer
import torch

bit8_model=AutoModelForSeq2SeqLM.from_pretrained(“google/flan-t5-large”)
quantized_model_path = “./quantized_model_full.pt”
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
bit8_model = bit8_model.load_state_dict(torch.load(quantized_model_path,map_location=device))
bit8_model.to(device)
bit8_model.eval()