Inference speed of jit model

kent252 · May 20, 2024, 7:29am

I’m a newbie in torch script. I tried to quantize a resnet18 model and use torch.jit.script to compress the model. The model from 45MB is compressed to 0.01MB in Jit form. However, I expect the inference time of the jit model (run on CPU) will be faster than (or maybe equal to) the original model (run on GPU) but it is just two times faster when they run on the CPU and the jit model (run on CPU) performs slower (around 4 times) than the original one (run on GPU).
I follow this code to quantize and compress to jit

num_calibration_batches = 32
from torch.ao.quantization import get_default_qconfig_mapping
torch.backends.quantized.engine = “x86”
prepared_model = prepare_fx(model, get_default_qconfig_mapping(“x86”), example_inputs=torch.rand((7,1,10,20)))
prepared_model.to(‘cpu’)

with torch.no_grad(): # calibrate using random data
data = torch.rand((1,3,32,32))
prepared_model(data)
#retrain the model to maintain the accuracy
train(prepared_model.to(‘cuda’), trainloader, valloader, 2, device)
prepared_model.to(‘cpu’)
prepared_model.eval()
model_quantized = convert_fx(prepared_model)

model_quantized_scripted = torch.jit.script(model_quantized) # script model
model_quantized_scripted(torch.rand((1,3,32,32)))

Please let me know, is that normal or what problem I am faced