Question on skipping quantization on unsupported modules

Arkeytect · February 24, 2022, 8:50pm

Let’s say I have a module block, where part of it is not currently supported for quantization, so I added Quant, Dequant stubs shown below.

> class ExampleBlock(nn.Module): 
        def __init__(self):
              self.quant() = QuantStub()
              self.dequant() = DeQuantStub()

        def forward(self, x):
              x = Supported(x)
              x = dequant(x)
              x = NotSupported(x)
              x = quant(x)
        return x

I have two questions. First, I am using this module as a building block for my network, meaning it is repetitively being called. I learned on this forum that QuantStubs need to be called each time it is needed, does it mean I have unroll the entire network where this module is being used?

Secondly, do I have to specify in the m.ExampleBlock.qconfig = None, and all the cases where it has been called to skip the quantization on NotSupported layers and functions?

Let me know if I explained my questions well enough.

Best,
Hua

ddang · February 25, 2022, 2:09pm

Could you please clarify what you mean by “unrolling the entire network” here?

Secondly, do I have to specify in the m.ExampleBlock.qconfig = None, and all the cases where it has been called to skip the quantization on NotSupported layers and functions?

I don’t think this is necessary (I think you’ll get an exception at runtime if you try to quantize something that’s not quantizable) if I understand your question correctly. Are you observing errors without this explicit specification.

Arkeytect · February 25, 2022, 7:01pm

Hi David:
Thanks for your reply, what i meant is the following. If I call this module which has unsupported layers for quantization as below, do I have to write ExampleBlock1(), ExampleBlock2(), etc or I could just use the single ExampleBlock() and the QuantStub and DeQuantStub will perform accordingly? I am asking, because I am quantizing a model right now, some of the layers are not supported so I used dequant() quant() to surpass them, but the result is very poor.

class model()
         def __init__():
              self.layer1 = ExampleBlock()
              self.layer2 = ExampleBlock()

ddang · February 28, 2022, 1:27pm

Hi Hua. I don’t think you need to wrap unsupported layers with a dequant and quant stubs. I’ll confirm with the team. Did you have issues when you didn’t use the stubs?

Edit: Sorry, I think I was wrong here. You’ll get an error if you try to pass the quantized output of a supported layer as an argument for a non-quantized layer. Is that what you’re asking?

suraj.pt · February 28, 2022, 3:12pm

Hi Hua,

With Eager mode, inserting quant/dequant stubs works for selective quantization. Can you clarify what you mean by “the result is very poor”? There are a few different ways to diagnose “poor performance” when using quantized models (see PyTorch Numeric Suite Tutorial — PyTorch Tutorials 1.10.1+cu102 documentation).

Re: using separate instances of ExampleBlock I think it is necessary if you have different weights.

I find using FX mode easier for selective quantization. In your example, I’d use it like

# skip quantization on NotSupported and Linear modules
qconfig = {
        "": torch.quantization.get_default_qconfig("fbgemm"), # global config
        "object_type": [(torch.nn.NotSupported, None), (torch.nn.Linear, None)] }, 
        }

prep = quantize_fx.prepare_fx(ExampleBlock(), qconfig_dict)
# calibrate
QuantizedExampleBlock = quantize_fx.convert_fx(prep)

class model:
  def __init__():
    self.l1 = QuantizedExampleBlock()
    self.l2 = QuantizedExampleBlock()

This might be a helpful reference: Practical Quantization in PyTorch | PyTorch

Arkeytect · February 28, 2022, 6:07pm

Hi David:
Yes, that’s right. I added dequant and quant stubs before and after the unsupported layer to bypass quantization. I was able to quantize and save the model. But when I jit load the quantized model, i encountered the “Could not run on Quantized CPU” which is very confusing.
By the way, my block configuration is "Conv + ReLu + Batchnorm ", as in Fuse_modules more sequence support. Since this configuration is not supported for fusion, I fused “Conv + ReLu” and bypassed quantization for Batchnorm. Wonder if I did anything wrong here.

Best,
Hua

Arkeytect · February 28, 2022, 8:14pm

Hi Suraj:
Thanks for your reply, I will try the Numeric Suite out.
Re: using separate instances of ExampleBlock I think it is necessary if you have different weights. Could you please clarify more on this. Do I need to write ExampleBlock1, ExampleBlock2… etc since they have quant and dequant stubs inside or do I just need to write one ExampleBlock and called them like in your code?

Best
Hua

suraj.pt · March 7, 2022, 3:37pm

I think you can get away with initializing multiple instances of the same ExampleBlock, as long as that serves your purpose. Each instance will be identical. I don’t think you need to create separate classes for each, based on the example you’ve provided!

strickland_ye · March 20, 2024, 3:24am

@suraj.pt Hi suraj. FX mode fuses linear and relu automatically by default and thus when I want to export torch model to onnx , I got error message like: Exporting the operator 'quantized::linear_relu' to ONNX opset version 17 is not supported, So is there some way to exclude some modules?

HDCharles · April 2, 2024, 5:34pm

with fx quantization, you normally create a qconfig mapping which species what things to quantize. You can manually specify individual modules to not quantize in the qconfig mapping by setting its qconfig to None. If you want to leave the relu’s non-quantized but quantize the linear ops, you can set the qconfig for relu ops to None. I’d give you a pointer to more info but it looks like the pytorch docs are down atm.

strickland_ye · April 3, 2024, 4:47am

@HDCharles Thanks for reply. I already knew I could skip to quantize some modules in fx model with qconfig mapping like:

 qconfig_mapping.set_object_type(torch.nn.Linear,None)
 qconfig_mapping.set_object_type(torch.nn.ReLU,None)

and if you use eager mode, setting qconfig as None will let the target module not to be quantized. quoted the docs:

For example, setting model.conv1.qconfig = None means that the model.conv layer will not be quantized

For the others are also interested in this subject could refer to the following docs: