Pytorch official documents mentioned that “the weights are quantized ahead of time but the activations are dynamically quantized during inference” . It also offers code for the simplest implementation.
However, when i tried to convert a floating-type model to one int-type model, the weights are quantized ahead of time, while the activations of intermidiate layer (e.g. Linear layer) seems that still keep floating type.
Is there some discrepancy or just my mistakenly understanding that question?
For example, the quantized fc layer just do this inference:
input_fp32 X FC_params_int8 → output_fp32. I can’t find how it quantize activations dynamically.
Yes, activations are quantized dynamically: i.e for every batch, the activations are quantized prior to the linear operation. This is done by calculating the dynamic range of the activations (min and max) and then quantizing the activations to 8 bits. This happens in C++ as part of the operator implementation itself, you can see the details at: