How does gradient calculation in quantization-aware training? Straight through estimatior?

lisyuan · February 27, 2024, 9:46am

Hi all,

I am confusing how the gradient calculation in QAT.
In my understanding, for example, the QAT of 8-bit process is mainly like
weight fp32 -> weight fake-quant 8-bit -> activation -> activation fake-quant 8-bit.
Then choose the observers to observe the min/max value to calculate the scales & zero points of weights/activation output.

So, is the fake-quant of weights/activation calculated in backward? if so, how does it caulcated?
The most common methods I surveyed is about Straight-Through-Estimator (STE), is that right?

But there are some answers show that it doesn’t participate in backward calculation,
Can we calculate gradients for quantized models? - #4 by HDCharles?
Register_backward_hook in quantized model - #2 by HDCharles?

Please correct me if I get a wrong idea. I want to implement a STE method in QAT. Therefore, I want to know whether pytorch do it automatically or not. Any suggestion is welcome. Thanks in advance!