Whether Activation quantize dynamically during dynamic quantization

Kbcai · February 21, 2021, 9:32pm

Pytorch official documents mentioned that “the weights are quantized ahead of time but the activations are dynamically quantized during inference” . It also offers code for the simplest implementation.

However, when i tried to convert a floating-type model to one int-type model, the weights are quantized ahead of time, while the activations of intermidiate layer (e.g. Linear layer) seems that still keep floating type.

Is there some discrepancy or just my mistakenly understanding that question?

For example, the quantized fc layer just do this inference:
input_fp32 X FC_params_int8 → output_fp32. I can’t find how it quantize activations dynamically.

raghuramank100 · March 8, 2021, 1:18am

Yes, activations are quantized dynamically: i.e for every batch, the activations are quantized prior to the linear operation. This is done by calculating the dynamic range of the activations (min and max) and then quantizing the activations to 8 bits. This happens in C++ as part of the operator implementation itself, you can see the details at:

github.com

pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L64


    /*m=*/input_ptr,
    /*min=*/&x_min,
    /*max=*/&x_max,
    /*len=*/input.numel());
// Input tensor is quantized as 8-bit unsigned values
static constexpr int precision = 8;
static constexpr bool is_signed = false;
// Calculate scale and zero point for quantization of input tensor
auto q_params = quant_utils::ChooseQuantizationParams(
    /*min=*/x_min,
    /*max=*/x_max,
    /*qmin=*/is_signed ? -(1 << (precision - 1)) : 0,
    /*qmax=*/
    is_signed ? ((1 << (precision - 1)) - 1) : (1 << precision) - 1,
    /*preserve_sparsity=*/false,
    /*force_scale_power_of_two=*/false,
    /*reduce_range=*/reduce_range);
q_params.precision = precision;