How reduce_range affects model performance?

Reuven_Peretz · March 23, 2020, 4:16pm

Hi,
I’m trying to do a post-training static quantization on mobilenetV2, as demonstrated in this tutorial.
I wanted to check how different observers affects the preformance of the model, and I got this:

activation=MinMaxObserver, weight=MinMaxObserve(dtype=torch.qint8) -> 67.27
activation=MinMaxObserver, weight=MovingAverageMinMaxObserver(dtype=torch.qint8) -> 67.21
activation=MinMaxObserver, weight=HistogramObserver(dtype=torch.qint8) -> 65.98
activation=MinMaxObserver, weight=PerChannelMinMaxObserver(dtype=torch.qint8) -> 42.87
activation=MinMaxObserver, weight=MovingAveragePerChannelMinMaxObserver(dtype=torch.qint8) -> 43.92

I noticed that the per-channel observers gives significantly worse accuracies, so I tried it using the reduce_range flag and got those results:
activation=MinMaxObserver, weight=PerChannelMinMaxObserver(dtype=torch.qint8, , reduce_range=True) -> 69.19
activation=MinMaxObserver, weight=MovingAveragePerChannelMinMaxObserver(dtype=torch.qint8, reduce_range=True) -> 68.93

I saw in the comments that “reduce_range reduces the range of the quantized data type by 1 bit. This is sometimes required to avoid instruction overflow”.
How can I tell if this is the case I’m issuing?
Is there some “thumb rule” for when reducing the range by 1 bit might help?

Thanks!