Quantizing Transformer Architecture Below 8-bit (post training quantization)

pkadambi · August 5, 2020, 7:14am

I’m trying to quantize BERT to 4 bits or mixed precision, and I don’t see available methods to to quantization aware training on BERT for any precision other than torch.uint8. This is given in the dynamic quantization tutorial.
I want to use both post training quantization and dynamic quantization for lower than 8 bits.

Will I have to rewrite the modeling_bert.py (transformers/modeling_bert.py) layers with fake quantization added? How can lower than 8bit precision and mixed precision be implemented on BERT?

tom · August 5, 2020, 10:17am

The difficulty there is PyTorch inherently assumes that things are at least 1 byte when doing things with memory.
I’d probably convert to TVM and see what can be done there.
(QAT with fake quantization probably could work for 4 bits, too.)

pkadambi · August 6, 2020, 4:04am

It’s not an issue even if the weights are stored as FP32 values in memory.
I’m trying to evaluate post training quantization or fine tune the model with quantization aware training, but do this all under under fake quantization to any bit width of my choosing.

tom · August 11, 2020, 2:23am

While I don’t think it works out of the box, you could try to adapt the observers and fake quant layers to be more flexible. For example, there are some obvious 8 bit hard coded values here:

github.com

pytorch/pytorch/blob/a414bd69de8d01af44751bfe327703ec997dafd9/torch/quantization/observer.py#L146


    Learned Step Size Quantization: https://openreview.net/pdf?id=rkgO66VKDS
    Trained Quantization Thresholds: https://arxiv.org/pdf/1903.08066.pdf
    """
    # The variable names are prefixed with "initial" because their values (qmin and qmax) might be adjusted
    # based on whether quantization range is reduced and the datatype (signed/unsigned) used by the observer.
    initial_qmin, initial_qmax = initial_dynamic_qrange
    assert initial_qmin <= 0 <= initial_qmax, "Dynamic quantization range must include 0."
    assert initial_qmin < initial_qmax, "qmin must be strictly less than qmax for dynamic quantization range."

@torch.jit.export
def _calculate_qmin_qmax(self):
    # type: () -> Tuple[int, int]
    r"""Calculates actual qmin and qmax based on the quantization range,
    observer datatype and if range is reduced.
    """
    if self.is_dynamic_qrange:
        # This initialization here is to be resolve TorchScript compilation issues and allow
        # using of refinement to decouple initial_qmin and initial_qmax from quantization range.
        # The actual values of initial_qmin and initial_qmax will be reset below.
        initial_qmin, initial_qmax = 0, 255
        # The following assignment of initial_qrange to a local variable and the if check refine the

jerryzh168 · August 21, 2020, 10:54pm

we do have the support for lower bits in https://github.com/pytorch/pytorch/blob/master/torch/quantization/observer.py#L185 now, one of our interns just added this recently.