Quantization-aware training for GPT2


I’m trying to perform QAT on GPT2 model, but I’m a bit confused about the documentation regarding the QuantStub.

  1. Where should I place the QuantStub and DeQuantStub? Based on my understanding I should place the first QuantStub after the Embedding layer and the DequantStub after the Relu activation layer of the FFN; then subsequently the QuantStub will be after the previous DequantStub, which is before the second linear layer of the FFN of the previous decoder layer. Is that correct?

  2. I can only fuse the first linear layer and the Relu activation layer of the FFN in each of the decoder layers. Am I right?

Thanks in advance!


I am also having problems using Quantization-Aware with GPT-2, did you find a solution? Can you share it with me? Thank you.

  1. I can talk about how to place quantstub/dequantstub in general

QuantStub should be placed at the point when you want to quantize a floating point Tensor to a quantized Tensor, the module following QuantStub is also expected to be quantized (to int8 quantized module)

DeQuantStub should be placed at the point when you want to dequantize a int8 tensor back to a floating point Tensor.


class M(torch.nn.Module):
    def __init__(self):
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.quant = QuantStub
        self.dequant = DeQuantStub()

    def forward(self, x):
        # original input assumed to be fp32
        x = self.quant(x)
        # after quant, x is quantized to int8 tensor
        x = self.conv(x)
        # we also need to quantize conv module to be a int8 quantized conv module
        # which takes int8 Tensor as input and outputs a int8 quantized Tensor
        x = self.dequant(x)
        # dequant would dequantize a int8 quantized Tensor back to a fp32 Tensor
        return x

Please post the exact model if you need more specific help on the model.

  1. Can you include the actual model implementation? in the meantime, you can find all supported fusions here: pytorch/fuser_method_mappings.py at master · pytorch/pytorch · GitHub

Hi Jerry,

Thanks for the explanation.

  1. In your example, the input is quantized from fp32 to int8 by the QuantStub module, but how about the weights in the layer (linear, or conv for example)? It seems that we don’t need to quantize the weight from your example?

  2. How about the output from previous layers? For example, the output from the previous linear or activation layer. I understand after the calculation the result should be fp32, so do we need to put QuantStub in between two layers?

  1. for weights, it’s quantized when we swap the floating point conv to quantized conv, it can be done through attaching a qconfig to the conv module instance
  2. there might still be some misunderstanding, a quantized module takes int8 Tensor as input and outputs a int8 Tensor as well, so if previous linear/activation layer is quantized (meaning we attach a int8 qconfig to that layer), we do not need to put a QuantStub in between the two layers.