Why does torch.jit.load(quantized_model) spend much longer time than torch.jit.load(fp32_model)?

Just wondering why INT8 quantized model requires much longer time than fp32 model when loading via torch.jit.load(). I think it’s very weird because the saved file size (using torch.jit.save()) of INT8 quantized model is 4x smaller than that of fp32 model.

Is there anyone who has the same issue? or is there any solution to reduce the loading time of quantized models like fp32 models?

quantized model may take longer time due to packing linear conv weighs

1 Like