Post Quantizing conv1d, PReLU & layerNorm layers can be done?

I build a pytorch model based on conv1d. I gone through quantization and implemented some cases as well but all those are working on conv2d, bn,relu but In my case, my model is built on conv1d and PReLU. Does this quatization valid for these network layers? Because when I did quantization only the layers which are included in mapping is only quantized. Let me show you those layers for which quantization is valid(i.e which are included in mapping)
Please find the list of modules it supports:(according to source code in went through)
(Actual layer : quantized layer)
nn.Linear: nnq.Linear,
nn.ReLU: nnq.ReLU,
nn.ReLU6: nnq.ReLU6,
nn.Conv2d: nnq.Conv2d,
nn.Conv3d: nnq.Conv3d,
nn.BatchNorm2d: nnq.BatchNorm2d,
nn.BatchNorm3d: nnq.BatchNorm3d,
QuantStub: nnq.Quantize,
DeQuantStub: nnq.DeQuantize,

Wrapper Modules:

nnq.FloatFunctional: nnq.QFunctional,

Intrinsic modules:

nni.ConvReLU2d: nniq.ConvReLU2d,
nni.ConvReLU3d: nniq.ConvReLU3d,
nni.LinearReLU: nniq.LinearReLU,
nniqat.ConvReLU2d: nniq.ConvReLU2d,
nniqat.LinearReLU: nniq.LinearReLU,
nniqat.ConvBn2d: nnq.Conv2d,
nniqat.ConvBnReLU2d: nniq.ConvReLU2d,

QAT modules:

nnqat.Linear: nnq.Linear,
nnqat.Conv2d: nnq.Conv2d,

Is it, it means that quantization canā€™t be done on conv1d and PReLU?

We are in the process of implementing the Conv1d module and ConvReLU1d fused module. The PR list is here https://github.com/pytorch/pytorch/pull/38438. Feel free to try out your model with the changes in this PR. The quantization flow should convert it.
We donā€™t currently support fusion with PReLU and LayerNorm, so they will have to be executed separately.

Hi @supriyar

Fusing is optional one in quantization if Iā€™m not wrong. We need our modules to be quantized i.e., each layer we implemented, in order to get our quantized parameters to pass through it. Only the quantized model will work if all the layers were quantized is it right? Or else we need to dequantize the parameters again before it passes through not quantized layer is it so?

Thank you @supriyar

Hi @supriyar

Hey, you gave me that PR link above in the last comment for support of quantized conv1d. I decided to give a try with that but the torch module is not importing and it showing the error ā€˜torch.versionā€™ is not there. So i copied ā€˜version.pyā€™ from my earlier version after this too its not importing torch throwing another error ā€˜torch._c import default_generatorsā€™ failed to import.

May I know that the changes in that PR is applicable for torch CPU version or not?

Thanks @supriyar ,
Aravind

Only the quantized model will work if all the layers were quantized is it right? Or else we need to dequantize the parameters again before it passes through not quantized layer is it so?

You can insert QuantStub, DequantStub blocks around the code that can be quantized. Please see (beta) Static Quantization with Eager Mode in PyTorch ā€” PyTorch Tutorials 2.1.1+cu121 documentation for an example of this.

May I know that the changes in that PR is applicable for torch CPU version or not?

The change is applicable for CPU, to get these changes you can either build pytorch from source or install from pytorch nightly.

Hi @supriyar,

Thank you, Now Iā€™m able to work with the changes in new PR as you suggested I tried with nightly.

Thanks @supriyar,
Aravind.

Hi @supriyar

I have done Quantizing my model, now I tried to save it with ā€˜jit.save(jit.script(model))ā€™ but it pops the error as ā€˜ā€˜aten::slice.t(t[] l, int start, int end=9223372036854775807, int step=1) -> (t[]): could not match type tensor to list[t] in argument ā€˜lā€™: cannot match list[t] to tensor.ā€™ā€™ very similar to this 2 more pops also arised. I googled about this error and in some discussions I found that it is the error regarding slicing(: , : , : ) operator says that ā€˜jit.script will not support scripting for directly used slicing operatorā€™. Is it the error actually pointed to that? Because I too used slicing operator in the middle of my model and error throws at the same line.

This is one thing and now I go with another alternative to save quantized model i.e., with state_dict(). Iā€™m able to save model with this but when I want to perform inference I have to initialize the model with the params which I have saved earlier with state_dict(). Now the params are quantized one but our model is defined for float. So error popped up as ā€˜exception occured : (ā€˜copying from quantized tensor to non-quantized tensor is not allowed, please use dequantize to get a float tensor from a quantized tensorā€™ā€™. This is the issue that I canā€™t save my model with jit and if I do so with state_dict() here I canā€™t initialize my model to go with inference.

Can you suggest any alternative?

Thanks @supriyar
Aravind.

Could you give a small repro for the error with aten::slice not working with jit?

Regarding loading the quantized state_dict - you will first have to convert the model to quantized model before loading the state dict as well. You can call prepare and convert APIs to do this (no need to calibrate). This way the model state dict will match the saved quantized state dict and you should be able to load it.

1 Like

Hi @supriyar

Thanks for the suggestion.

I tried with this one and its done.

Hey regarding

This has done successfully because JIT might accepts only direct representation of Int for slicing as ā€œw[:,:,:-16]ā€ where as I initially represented as ā€œw = w[:,:,:-2*2**3]ā€. So tried a chance and it worked.

Something I would like to say is JIT is not able to find attribute ā€œ.new_tensorā€ where as ā€œ.clone().detach()ā€ is identified

Thanks @supriyar,
Aravind

Hi @supriyar,

I have some doubt about operations on quantized variables. Here Iā€™m quoting it

    z = self.dequant_tcnet(z)
    w = self.dequant_tcnet(w)
    v = self.dequant_tcnet(v)
    x = self.dequant_tcnet(x)
    z = z + w
    x = x + v
    x = self.quant_tcnet2(x)

Here inorder to perform these 2 operations z = z + w; x = x + v I need to dequantize the variables involved in that operations and then perform those operations. Can we able to perform those operations without dequantizing i.e., as quantized variables because if I do run this without dequantizing variables ā€œRuntimeError: Could not run ā€˜aten::add.Tensorā€™ with arguments from the ā€˜QuantizedCPUā€™ backend. ā€˜aten::add.Tensorā€™ is only available for these backends: [CPU, MkldnnCPU, SparseCPU, Autograd, Profile].ā€ error is interrupting.

Can we perform ā€œadditionā€ on 2 quantized tensor variables? without dequantizing them in any other alternative.

Thanks @supriyar,
Aravind.

Hi @supriyar,

z = self.dequant_tcnet(z)
w = self.dequant_tcnet(w)
v = self.dequant_tcnet(v)
x = self.dequant_tcnet(x)
z = z + w
x = x + v
x = self.quant_tcnet2(x)

Can we perform ā€œadditionā€ on 2 quantized tensor variables? without dequantizing them in any other alternative.

For this i tried QFunctional & FloatFunctional but output is not up to the mark by using this. where as placing quantstubs works good.

I have some concern here. My float model which takes 0.2 - 0.3 second (~300ms) to process single input whereas after i quantizing my model with Int8 Quantization the time taken is increased from 0.2-0.3(float precision) to 0.4 to 0.5(Int8).

Here i show you the exact float model block and quantized model block

  ***** float block ******
   *****Round 1********
    y = self.conv1x12(x)
    y = self.prelu2(y)
    y = self.norm2(y)
    w = self.depthwise_conv12(y)
    #w = w[:,:,:-2*2**2]
    w = w[:,:,:-8]
    y = self.depthwise_conv2(y)
    y = y[:,:,:-8]
    y = self.prelu22(y)
    y = self.norm22(y)
    v = self.pointwise_conv2(y)
    z = z + w
    x = x + v

This is float model block. This block/computation will be repeated 13 more times (total 14 blocks). This is taking 0.2 - 0.3 seconds

Quantized block
round 1**
y = self.conv1x12(x)
y = self.prelu2(y)
y = self.norm2(y)
w = self.depthwise_conv12(y)
w = w[:,:,:-8]
y = self.depthwise_conv2(y)
y = y[:,:,:-8]
y = self.prelu22(y)
y = self.norm22(y)
v = self.pointwise_conv2(y)
w = self.dequant_tcnet(w)
z = self.dequant_tcnet(z)
v = self.dequant_tcnet(v)
x = self.dequant_tcnet(x)
z = z + w
x = x + v
x = self.quant_tcnet3(x)

This is quantized block model where is placed quantstubs for those arthematic operations & remaining all layers are quantized. This quantized model is taking 0.4 - 0.5 seconds

So after quantizing my model, the size of model is optimized but computation time is not optimized. Could you tell me is there any flaw? I crosschecked and the output is also good but computation is not reduced

Thanks @supriyar,
Aravind.

Hi @supriyar @raghuramank100,

Referring to above mentioned issue I want to make it clear about for which layers i have done quantization.

  1. Conv1d (from nightly)
  2. LayerNorm (from nightly)
  3. ReLU
  4. Linear

additional layers:

  1. quantstub / dequantstubs
  2. QFunctional / FloatFunctional

All these layers are quantized and I fused Relu and Conv1d as well ( since beginning Im referring to this documentation Static Quantization with eager mode in pytorch

If I use FloatFunctional, Iā€™m not using Quant/Dequantstubs in my model where arithmetic operations are triggered between quantized layers.

After successfully done quantizing, still my model CPU computation is not reduced instead computation has increased after quantization!

Could you tell me in which cases this might happen?

Thanks,
Aravind

For add you could use torch.nn.FloatFunctional, the extra dequant and quant ops in the network could be slowing things down.

Regarding performance you can try running the torch.autograd.profiler on your model for some iterations to see which ops take up most time. It will give you an op level breakdown with runtime so you can compare float vs quantized model.

Hi @supriyar,

Yah, I did that replacement of quantstubs/dequantstubs with floatfunctional. Iā€™ll drop it here

    > y = self.conv1x11(x)
    y = self.prelu1(y)
    y = self.norm1(y)
    w = self.depthwise_conv11(y)
    #w = w[:,:,:-2*2**1]
    w = w[:,:,:-4]
    y = self.depthwise_conv1(y)
    #y = y[:,:,:-2*2**1]
    y = y[:,:,:-4]
    y = self.prelu11(y)
    y = self.norm11(y)
    v = self.pointwise_conv1(y)
    #w = self.pointwise_conv_skp1(y)
    #z = self.dequant_tcnet(z)
    #w = self.dequant_tcnet(w)
    #v = self.dequant_tcnet(v)
    #x = self.dequant_tcnet(x)
    #z = z + w
    #x = x + v
    z = self.Qf_s.add(z,w)
    x = self.Qf_s.add(x,v)
    #x = self.quant_tcnet2(x)
    #z = self.quant(z)

as you can see now I removed stubs and using FloatFunctional but yet usage is not reduced

Thanks @supriyar,
Aravind

Hi @supriyar @raghuramank100,

Regarding CPU usage i debug the usage with

Here Iā€™m listing the output:

Usage of Float model:


Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls


slow_conv_dilated2d 25.34% 127.183ms 98.72% 495.549ms 42.081us 11776
size 15.79% 79.242ms 15.79% 79.242ms 0.463us 171097
_cat 9.76% 49.007ms 11.61% 58.285ms 2.534ms 23
mkldnn_convolution 9.67% 48.534ms 19.48% 97.764ms 1.397ms 70
threshold 6.45% 32.364ms 6.52% 32.726ms 1.091ms 30
slice 5.26% 26.391ms 9.76% 49.012ms 2.065us 23737
native_layer_norm 3.79% 19.017ms 7.74% 38.849ms 669.817us 58
convolution 3.71% 18.632ms 86.18% 432.610ms 7.459ms 58
empty 2.81% 14.117ms 2.82% 14.143ms 1.179us 11991
select 2.65% 13.278ms 9.45% 47.425ms 4.027us 11778
as_strided 2.52% 12.638ms 2.52% 12.638ms 0.530us 23837
fill
2.37% 11.903ms 2.38% 11.933ms 2.026us 5889
add 2.23% 11.208ms 4.53% 22.760ms 421.489us 54

total time : 502 ms

Cpu usage of Quantized model without fusing:


Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls


quantized::conv1d 81.39% 454.615ms 81.39% 454.615ms 7.838ms 58
quantized::add 9.25% 51.657ms 9.25% 51.657ms 1.987ms 26
quantized::layer_norm 7.62% 42.582ms 7.62% 42.582ms 1.468ms 29
relu 0.63% 3.513ms 1.31% 7.296ms 121.595us 60
quantized::mul 0.59% 3.322ms 0.59% 3.322ms 3.322ms 1
quantized::linear 0.18% 1.019ms 0.18% 1.019ms 1.019ms 1

total time : 558ms

CPU usage of quantized model after fusing:


Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls


quantized::conv1d 41.77% 239.086ms 41.77% 239.086ms 7.970ms 30
quantized::conv1d_relu 39.94% 228.633ms 39.94% 228.633ms 8.165ms 28
quantized::add 9.35% 53.523ms 9.35% 53.523ms 2.059ms 26
quantized::layer_norm 7.67% 43.932ms 7.67% 43.932ms 1.515ms 29
quantized::mul 0.62% 3.564ms 0.62% 3.564ms 3.564ms 1
index_add_ 0.27% 1.542ms 0.54% 3.100ms 1.550ms 2
quantized::linear 0.18% 1.017ms 0.18% 1.017ms 1.017ms 1
relu 0.06% 370.631us 0.13% 762.027us 190.507us 4

total time : 572 ms

If you see these three usages conv1d is taking high usage after quantization

quantized::conv1d 239.086ms
quantized::conv1d_relu 228.633ms
slow_conv_dilated2d 127.183ms

Could anyone have any view why this was happened i.e., quantized conv1d increased CPU usage?

Thanks @supriyar @raghuramank100,
Aravind.

Hi @raghuramank100 @supriyar

Do you have any idea why this quantized Conv1d takes more time?

Thanks @raghuramank100 @supriyar,
Aravind

We are currently implementing the operator using quantized::conv2d under the hood after unsqueezing the activation and weight tensors. That might be proving sub-optimal for certain input shapes.

Could you give us details about the input and weight dimensions and the parameters (kernel, stride, pad, dilation) to conv1d in your case?

Hi @supriyar,

Thanks for the reply.

Here Iā€™m attaching some layers in my model

These are the layers in float model

Thanks @supriyar,
Aravind.