Post-Training Quantization to Custom Bitwidth

Hi, I need to do post-training quantization of a ResNet-18 model to custom bitwidth. I would like to be able to post-training quantize to 7, 6, 5, 4, 3, and 2 bits both weights and activations so that I can evaluate how different models (pre-trained with different losses) can withstand aggressive quantization.

I managed quite easily to experiment with INT8 static quantization, but I can’t seem to find a straightforward way of going below 8 bits. I found some previous entries such as Expending PyTorch with lower than 8-bit Quantization and How to Quantize CNN into 4-bits? - #2 by jerryzh168 but they’re not very recent. Is there now an easy way to do this in pytorch? Or else, do you know of code examples that I can check to get an idea of how to implement this? Thanks in advance

Here is a github repo for an ICLR paper that implements a new quantization scheme that goes down to 2-bits. I tried it and was able to run it right away.

1 Like

Do you want to simulate the numerics only? I think you could use FakeQuantize module and restrict the quant_min/quant_max to the desired range.

I will have a look at this entry, thank you very much!

Yes, I’m interested in seeing how accuracy decreases with post-training quantization with different bitwidths and losses. Do you have a code example on how to use FakeQuantize? the documentation is not quite straightforward to understand. Thanks

A more or less ‘baked in’ version of this is quantization aware training where the quantization library simulates the quantized operator using fake quants. This is generally used for training but would seem to work for your purposes.

See Quantization Aware Training section here: Quantization — PyTorch master documentation

you could then adapt that to suit your altered bit values i.e. the tutorial recipe altered for int4 would be:

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

model_fp32 = M()

model_fp32.train()

## int8 qconfig:
# model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

######### note the above qconfig is equivalent to: ######################
##act_fq=FakeQuantize.with_args(observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=255,  dtype=torch.quint8, qscheme=torch.per_tensor_affine, reduce_range=True)
##...
## weight_fq=FakeQuantize.with_args(observer=MovingAverageMinMaxObserver, quant_min=-128, quant_max=127, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, reduce_range=False)
##model_fp32.qconfig = QConfig(activation=act_fq, weight=weight_fq)

#B is bits
B=4

##intB qconfig:
intB_act_fq=FakeQuantize.with_args(observer=HistogramObserver, quant_min=0, quant_max=2**B-1,  dtype=torch.quint8, qscheme=torch.per_tensor_affine, reduce_range=False)

intB_weight_fq=FakeQuantize.with_args(observer=HistogramObserver, quant_min=-(2**B)/2, quant_max=(2**B)/2-1, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, reduce_range=False)


intB_qconfig=QConfig(activation=intB_act_fq, weight=intB_weight_fq)

model_fp32.qconfig=intB_qconfig

model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)

# calibrate model
model_fp32_prepared.apply(torch.ao.quantization.disable_fake_quant)
calibration_code()

#prevents fake_quant from changing based on test code
model_fp32_prepared.apply(torch.ao.quantization.enable_fake_quant)
model_fp32_prepared.apply(torch.ao.quantization.disable_observer)

# test intB numbers
test_code()

Some additional thoughts:
If you are trying to mimic hardware (which is what the flow is generally geared towards), you need the argument reduce_range=True (for certain backends) for the activation part in order to deal with overflow issues in hardware (which just removes one bit to avoid overflow, so the activations are actually ~quint7 and the weights are qint8). If you are just simulating numerics though that wouldn’t be needed. The above example sets reduce_range=False accordingly.

In QAT MovingAverageMinMaxObserver is used to gradually alter the range as the weights change which is probably not something you’re interested in. I replaced that with HistogramObserver which is what you’d use in normal quantization and is generally more accurate but slower, if thats a problem you could replace it with MinMaxObserver.

The ‘disable_fake_quant’ in the calibration code is there because normal quantization calibration doesn’t actually simulate the quantization numerics with fake_quant during calibration.

Let me know if you have any questions

6 Likes

This is of great help, thank you very much, really appreciated.

@HDCharles and @jerryzh168, I have a follow-up query to this. I am working on simulating a model on hardware using PyTorch and trying to understand what happens at a single convolution level with post-training static quantization. For example: Consider a Kernel tensor (Height, Width,Depth = 3, 3, 3) and kernel corresponding individual elements are (K1,K2,…K9) convolved with Input Tensor elements (I1,I2,…I9). The result of these convolutions is R = K1I1+ K2I2+ K3I3+…+K9I9. For a static quantized model to int8
Queries:
1: let’s assume K1 =255 and I1=255, then the result would be greater than int8. What would happen in this case
2: Similarly, the sum K1I1+ K2I2+ …+K9*I9 can also be above int8, @HDCharles I guess you have answered but we fail to select ‘reduce_range= True’ what would happen to the result.

maybe you can take a look at pytorch/Conv.cpp at master · pytorch/pytorch · GitHub to understand how we can do int8 convolution numerically. the internal accumulation of these ops are typically int32 or float and then we do a requantization in the end to requantize the output to int8

2 Likes

Thank you @jerryzh168, it helped me to understand the operations. Does sub int8 quantization needs to be done as @HDCharles mentioned or is way of using this (https://github.com/pytorch/pytorch/pull/33743)

what @HDCharles mentioned makes sense I think. this is just simulating the numerics for sub int8 models I think, to see the perf gains you’ll also need a way to have sub int8 ops implemented for the hardware

1 Like

@jerryzh168, how is bias quantization done. I tried printing the int8 weights with int_repr(), which shows integers. However, for bias it tells NotImplementedError: Could not run 'aten::int_repr' with arguments from the 'CPU' backend. If I print the bias values they are of type float.

bias quantization happens dynamically in the kernel right now, so in the quantized module it would be floating point Tensor

1 Like

Thanks @jerryzh168 , can you please redirect me to information about how bias quantization can be done manually?

I am trying to simulate hardware which supports only int, therefore I need to convert bias also into int. I need it to be consistent with PyTorch implementation so that I can measure the degradation in accuracy.

If we want to have separate quantization parameters (scale/zero_point) for bias, I think you can write a qat module similar to this: pytorch/conv.py at master · pytorch/pytorch · GitHub that has a bias_fake_quant, and you can refer to
rfcs/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md at master · pytorch/rfcs · GitHub and pytorch/native.py at master · pytorch/pytorch · GitHub to extend/modify the backend_config_dict configuration to integrate this into the fx graph mode quantization flow.

If you want to reuse the quantization parameters of input and scale, it will only be possible in FX Graph Mode Quantization, and we haven’t supported this kind of extension in our framework yet.

1 Like

Thanks a lot, @jerryzh168 for the detailed response. So, from your response I believe finding the quantization bias and input for Post Training Static Quantization is not possible?

Does PyTorch has a reference paper explaining the quantization techniques like in (https://arxiv.org/pdf/1806.08342.pdf) so that I can try to find them?
I have developed my code with Post Training Static Quantization. So, trying to avoid QAT and FX Garph Mode Quantization unless its the only way.

Are you trying to (1). use the scale from input and weight of the module? or (2). to calculate a independent scale/zero_point for bias? I believe (1) is not possible from eager mode quantization, (2) is possible if you implement a different qat module to simulate the numerics. For post training, we don’t have kernels for it, not sure how you are planning to use it. can you describe the whole flow, either with code or pseudo code?

Raghu was the TL for PyTorch Quantization project, we are mostly following his paper (https://arxiv.org/pdf/1806.08342.pdf) while developing PyTorch Quantization.

Firstly @jerryzh168, I want to sincerely thank you for being so responsive and kind.

I cannot share the complete details regarding implementation in the forum as its relates to my research paper.

I am trying to simulate my photonic CNN inference accelerator, which can perform vector dot product in the optical domain using photonic integrated circuits (such as Holylight-A Nanophotonic Accelerator for Deep Learning). However, they can only support limited-bitwidth and of type integer only.

For my current accelerator design,

  • I need to compare the inference accuracy drop for CNN models while running on my accelerator.

For achieving this I need to have a custom Conv2d and Linear method that are consistent with accelerator working which I was able to develop using unfold, fold, and matmul operation.

My flow is described below steps:

  1. Train a model at float precision for a dataset
  2. Quantize this model using post-training static quantization, note the accuracy (AccQuant)
  3. Get int8 weights and bias values for each layer from the quantized model
  4. Define the same model with my custom Conv2d and Linear methods (PhotoModel)
  5. Assign the weights and bias obtained from the quantized model
  6. Run inference with PhotoModel and note the accuracy drop
    Optical Vector dot product computation has some errors, by performing these experiments I want to quantify the accuracy drop due to these errors.

The input, weight, and bias values to the custom Conv2d and Linear methods need to be int8 to precisely simulate the accelerator. I am able to obtain weights with int8 representation from the quantized model. However, I am not sure how to get the int8 values of bias.

Please let me know if you have further questions.

Thanks,
Sairam

Still not sure if you are doing (1) or (2). Assuming it is (2). Then I think the easiest way is probably define a qat conv module, and use the qat api instead, to get the quantized weight, for example, you can get the weight_fake_quant module and get the scale and zero_point from that, and then quantize the weight. Similarly for bias. Steps are
1). Define a qat module that is similar to pytorch/conv.py at master · pytorch/pytorch · GitHub but has a bias_fake_quant, it will fake quantize the weight as well as bias
2). change the qat entry for nn.Conv2d to use the new module: pytorch/quantization_mappings.py at master · pytorch/pytorch · GitHub
3). prepare the model for QAT with eager mode quantization api
4). turn off fake_quant model.apply(torch.ao.quantization.disable_fake_quant), observer is on by default so no extra operation needed for that.
5). calibrate with example data
6). get the int8 weights and bias using weight_fake_quant and bias_fake_quant
7+). other steps are the same as you described

1 Like

I have been trying to follow this code snippet and adapt it to a ResNet20 to reproduce results at different bitwidths without any luck, meaning that I always get validation accuracy equal to the bitwidth B I’m using (8% for 8 bits, 4% for 4 bits, and so on). It’s not clear at all to me where and how many times to put the self.quant and self.dequant code lines in the resnet definition, and also how to correctly fuse the model.

These are the code changes I did to the BasicBlock (bold):

class BasicBlock(nn.Module):
expansion = 1

def __init__(self, in_planes, planes, stride=1, option='A'):
    super(BasicBlock, self).__init__()
    **self.quant = torch.quantization.QuantStub()**
    self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(planes)
    self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(planes)

    self.shortcut = nn.Sequential()
    if stride != 1 or in_planes != planes:
        if option == 'A':
            self.shortcut = LambdaLayer(lambda x:
                                        F.pad(x[:, :, ::2, ::2], (0, 0, 0, 0, planes // 4, planes // 4), "constant",
                                              0))
        elif option == 'B':
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion * planes)
            )
    **self.dequant = torch.quantization.DeQuantStub()**

def forward(self, x):
    **out = self.quant(x)**
    out = F.relu(self.bn1(self.conv1(out)))
    out = self.bn2(self.conv2(out))
    out += self.shortcut(x)
    out = F.relu(out)
    **out = self.dequant(out)**
    return out

and to the ResNet module (bold):

class ResNet(nn.Module):
def init(self, block, num_blocks, num_classes=10):
super(ResNet, self).init()
self.quant = torch.quantization.QuantStub()
self.in_planes = 16

    self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(16)
    self.layer1 = self._make_layer(block, 16, num_blocks[0], stride=1)
    self.layer2 = self._make_layer(block, 32, num_blocks[1], stride=2)
    self.layer3 = self._make_layer(block, 64, num_blocks[2], stride=2)
    self.linear = nn.Linear(64, num_classes)

    self.apply(_weights_init)
    **self.dequant = torch.quantization.DeQuantStub()**

def _make_layer(self, block, planes, num_blocks, stride):
    strides = [stride] + [1] * (num_blocks - 1)
    layers = []
    for stride in strides:
        layers.append(block(self.in_planes, planes, stride))
        self.in_planes = planes * block.expansion

    return nn.Sequential(*layers)

def forward(self, x):
    **out = self.quant(x)**
    out = F.relu(self.bn1(self.conv1(out)))
    out = self.layer1(out)
    out = self.layer2(out)
    out = self.layer3(out)
    out = F.avg_pool2d(out, out.size()[3])
    out = out.view(out.size(0), -1)
    out = self.linear(out)
    **out = self.dequant(out)**
    return out

And this is how I fuse the model:

model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [[“conv1”, “bn1”]], inplace=True)
for module_name, module in model_fp32_fused.named_children():
if “layer” in module_name:
for basic_block_name, basic_block in module.named_children():
torch.quantization.fuse_modules(
basic_block, [[“conv1”, “bn1”], [“conv2”, “bn2”]],
inplace=True)

What am I doing wrong? I’m sorry if it’s trivial but it’s very difficult to understand how to implement this for more complex models from the documentation. Thank you