Recommended practice for reusing quantized models?

Hello,

To my knowledge, reusing networks in Pytorch typically requires a network class definition and a weights file (i.e., .pth), which is saved and loaded using the state_dict mechanism.

In quantization, the problem is that the quantization process (e.g., post-training quantization) modifies the network class instance. This means that in order to reproduce the quantized model, either a programmer needs to define a new class that will be compatible with the modified instance, or the quantization process must be repeated on every new instance of the original FP32 model.

On Linux machines, it might be a reasonable workaround to post-training-quantize every new instance of the network. However, this scenario is not possible on Windows machines, since performing quantization is not currently supported on them.

What is therefore a recommended practice for really saving and loading an already-quantized network?

For example, is using Python’s pickle mechanism going to do the work? Can I quantize a network on Linux, save it using Pickle and reload it on a Windows machine? Is it a recommended approach?

you might want to save the quantized_model using

save_torchscript_model(model=quantized_model, model_dir=model_dir, model_filename=quantized_model_filename)

and load the script in the other place for inference:

quantized_jit_model = load_torchscript_model(model_filepath=quantized_model_filepath, device=cpu_device)
quantized_jit_model.eval()
outputs = quantized_jit_model(inputs)

quantization previously didn’t work on windows because fbgemm wasn’t supported on windows. This is no longer the case so there should be no issue (Is this available on windows? · Issue #150 · pytorch/FBGEMM · GitHub). Additionally, the issue wasn’t that you couldn’t do quantization on windows, its that windows didn’t have kernels for quantized ops, so you could still do the quantization, you just couldn’t run the quantized model.

note: torchscript wouldn’t get around this, but it is a good solution to not wanting to go through the quantization process each time you want to load a quantized model.