Slower ops in quantized::mul, quantized::cat

Hello,

During quantization, I realized quantized operations such as quantized::mul, quantized::cat are x10 slower than fp32 ops.

Is the only workaround wrapping those functions with dequant() and quant()? Please refer to the profiling below…

FP32 Profiling (CPU)
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
        aten::mkldnn_convolution        36.28%      14.237ms        37.56%      14.741ms       1.340ms       7.88 Mb           0 b            11    
        aten::upsample_nearest2d         6.71%       2.632ms         8.38%       3.288ms     469.671us      15.33 Mb      15.32 Mb             7   
                       aten::mul         6.10%       2.395ms         6.10%       2.395ms     342.071us      19.41 Mb      19.41 Mb             7  
                      aten::_cat         5.54%       2.175ms         6.69%       2.626ms     375.100us      19.41 Mb           0 b             7
                      aten::_cat         5.28%       2.074ms         6.45%       2.533ms     361.814us      19.41 Mb           0 b             7
        aten::upsample_nearest2d         4.50%       1.766ms         7.26%       2.851ms     407.229us      15.33 Mb      15.32 Mb             7
Quantized Profiling (CPU)
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------   
                   quantized::mul        20.94%      21.950ms        21.79%      22.839ms       3.263ms       4.85 Mb           0 b             7    
                   quantized::cat        18.42%      19.306ms        18.84%      19.741ms       2.820ms       4.85 Mb           0 b             7   
                   quantized::cat        17.84%      18.692ms        18.27%      19.143ms       2.735ms       4.85 Mb           0 b             7   
                quantized::conv2d         8.18%       8.576ms         8.86%       9.285ms       1.326ms       1.02 Mb      -4.08 Mb             7   
          quantized::batch_norm2d         4.65%       4.878ms         5.34%       5.598ms     933.033us     980.00 Kb      -7.75 Kb             6    
                quantized::conv2d         4.35%       4.561ms         4.90%       5.134ms     733.500us     981.00 Kb      -3.83 Mb             7    
                   quantized::mul         4.34%       4.553ms         5.13%       5.380ms     768.571us       1.02 Mb           0 b             7    
            quantized::leaky_relu         1.73%       1.810ms         2.13%       2.234ms     372.417us     980.00 Kb           0 b		6

yeah I think a lot of them are using dequant/quant to simulate the quantized operation for quantized::cat, but for quantized::mul I remember we do have more efficient implementations in fbgemm/qnnpack. Wondering which quantized engine are you using right now? and which platform did you run this?

you can print quantized engine by: print(torch.backends.quantized.engine)

I am currently running this model in x64 Windows 10 environment. My final goal is to trace this model using TorchScript and save it to .pt file, so that I can import this model in my .cpp application, which will run in Android device (Galaxy S10). According to my plan, if I set the quantization configuration as qnnpack, will this model give better result when it is deployed to the target device?

It is just a rough plan, please share some of your experienced thought. I really appreciate that, since I am kinda newbie to this deployment plan

On Windows 10 machine(x64)

print(torch.backends.quantized.engine)
-------------------------------------------
fbgemm

depends on which qconfig you are using when you quantize the model, are you suing get_default_qconfig(“fbgemm”) or get_default_qconfig(“qnnpack”)?
If you are using the qnnpack qconfig, then you should only run with qnnpack backend, because fbgemm backend would have overflows.
If you are using fbgemm config, then you can run on both backends.