Select the right observers in QAT

xjtuljy · January 11, 2024, 10:27pm

Hi, I’m playing with different observers, by defining weight and activation observers in torchvision ImageNet classification (link) using:
torch.ao.quantization.QConfig()
I found that by changing from MovingAverageMinMaxObserver to other observers like MinMaxObserver leads to a performance drop from QAT FP model to quantized INT8 model.

Is there any guidance regarding observer selection for QAT in practice?

HDCharles · January 12, 2024, 6:02pm

QAT is a training process, its better to have an activation observer that changes as the model changes and doesn’t have an infinite memory. MovingAverageObservers have this, other observers do not.

If your weights shift such that your activations originally were -1 to 1 and now go from -.1 to .1, your minmaxobserver will still try to handle values from -1 to 1, which is not what you want.

xjtuljy · January 16, 2024, 11:08pm

Thanks for the explanation! Can we then say MovingAverageObservers is in general preferred over Minmaxobserver? I’ll read the source code to get better understanding.

HDCharles · January 17, 2024, 5:49pm

Yeah, for activations while doing QAT you usually want moving average observer. For weights you usually do a moving average observer with the constant set so it has no memory. For PTQ it’s usually histogram observer for activations and minmaxobserver for weights

xjtuljy · January 19, 2024, 12:05am

I see. I have two follow-up questions:

why can there be a gap between QAT FP model and quantized INT8 model? My understanding towards QAT is that, it applies constraints to make the FP model to have same weights as INT8 model, so technically it should yield exactly the same/very close results to the quantized model, regardless of what observer is used. Is my understanding correct?
for “moving average observer with the constant set” do you mean setting “quant_min” and “quant_max”, or “self.min_val” and “self.max_val” to fixed values?

HDCharles · January 19, 2024, 4:54am

I’m not sure what you mean. There isn’t just one quantized model. Doing PTQ vs QAT yields different models, otherwise there’d be no point to do QAT.

No, those contestants don’t control the memory/momentum