Qnnpack vs. fbgemm

Hi!
I am trying to implement quantization in my model.
In the case of Post Static Quantization some interesting detail came across:

quantized_model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
# torch.backends.quantized.engine = 'qnnpack' # gives error

works nearly perfect according to performance numbers. However, qnnpack is not available as an engine on my machine.

Trying to use

quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')

led to much worser performance numbers.

Also, in my opinion this should not work, but does perform very good:

quantized_model.qconfig = torch.quantization.get_default_qconfig('qnnpack') 
torch.backends.quantized.engine = 'fbgemm'

Is this a bug? Shouldn’t fbgemm outperform qnnpack an a x86 system?

Yes, that would be expected. Does your system have AVX and AVX2 capabilities? Those are needed for the fast paths of the fbgemm kernels.

Yes, sounds like it could be a bug. Would you be able to share the per-op profiling results for the model you are seeing this for using https://pytorch.org/docs/stable/autograd.html#profiler on both fbgemm and qnnpack on your machine? Qnnpack only has fast kernels on ARM, on x86 it is taking the slow fallback path.

Profile for fbgemm for evaluation:

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mul                                   64.88%           3.584s           65.20%           3.602s           13.341ms         270              
sum                                   13.79%           761.666ms        15.01%           829.509ms        1.097ms          756              
quantized::linear                     12.68%           700.596ms        12.68%           700.596ms        19.461ms         36               
_cat                                  3.06%            168.962ms        3.14%            173.683ms        6.433ms          27               
relu                                  1.32%            73.125ms         1.34%            73.805ms         2.734ms          27               
fill_                                 1.17%            64.873ms         1.17%            64.876ms         82.855us         783              
index_select                          0.73%            40.152ms         1.22%            67.359ms         95.953us         702              
copy_                                 0.42%            23.189ms         0.42%            23.197ms         44.438us         522              
empty                                 0.39%            21.815ms         0.39%            21.815ms         9.696us          2250             
quantize_per_tensor                   0.38%            20.759ms         0.38%            20.771ms         2.308ms          9                
cat                                   0.16%            9.051ms          3.31%            182.734ms        6.768ms          27               
embedding                             0.15%            8.441ms          2.80%            154.721ms        110.200us        1404  
...

Metrics:

Size (MB): 3.466263
Loss: 1.093 (not good)
Acc: 0.622
Elapsed time (seconds): 7.084
Avg execution time per forward(ms): 0.00363

Profile for qnnpack for evaluation:

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mul                                   66.18%           3.379s           66.49%           3.395s           12.573ms         270              
sum                                   12.98%           662.933ms        14.21%           725.287ms        959.374us        756              
quantized::linear                     12.45%           635.799ms        12.45%           635.799ms        17.661ms         36               
_cat                                  3.14%            160.059ms        3.23%            164.724ms        6.101ms          27               
relu                                  1.33%            67.692ms         1.34%            68.278ms         2.529ms          27               
fill_                                 1.17%            59.914ms         1.17%            59.917ms         76.522us         783              
index_select                          0.68%            34.661ms         1.11%            56.808ms         80.923us         702              
empty                                 0.38%            19.191ms         0.38%            19.191ms         8.529us          2250             
quantize_per_tensor                   0.37%            18.920ms         0.37%            18.930ms         2.103ms          9                
copy_                                 0.35%            17.947ms         0.35%            17.954ms         34.394us         522              
embedding                             0.14%            7.034ms          2.52%            128.492ms        91.519us         1404             
...

Metrics:

Size (MB): 3.443591
Loss: 0.580 (very good)
Acc: 0.720
Elapsed time (seconds): 6.978
Avg execution time per forward(ms): 0.00427

hmm, one hypothesis that would fit this data is that fbgemm is not enabled, and both fbgemm and qnnpack are taking the fallback paths.

cc @dskhudia , any tips?

@Vasiliy_Kuznetsov How does such fallback path look like? What happens in such a case?

@pintonos By performance do you mean the loss? It should be the same (close enough) loss with both. I see the execution time similar with both fbgemm and qnnpack for quantized::linear.

@dskhudia Yes i mean loss. I tried to run it on a bigger dataset and it seems to work now…

1 Like