Do lite model have performance drop?


I would like to deploy my pytorch model on mobile devices.

Based on here: Home | PyTorch. I need to quantize my model and convert to ptl file.
it seems like I can do inference using quantized model (How to load quantized model for inference) so that I can evaluate whether the performance is dropped.

But I don’t know how to make sure there is no performance drop from quantized/optimized model to ptl file? Or it is guaranteed that conversion from quantized model to ptl has no performance drop?


Quantized models will typically have a degradation in accuracy, it’s up to you to decide if the perf speedups are worth that degradation. This is a good blog on the topic Practical Quantization in PyTorch | PyTorch

Thanks Mark! Do you know if there is any way to inference on ptl file in colab/python? So that I can have a better idea of how much the performance is dropped?
I think I might just try fp32 first, which I assume should be no precision loss, but I am not sure if there could be performance drop due to different implementation of pyTorch operations ( e.g, the implementation in the lite_interpreter)