QNNPACK way slower than FP32 especially the add operation

This has already come up here but I feel it was not addressed properly.

I’m facing this issue with FX tracing on timm's efficientnet_b3. I get 10x speedup on regular backend but a 4x slow down on qnnpack (torch.backends.quantized.engine = 'qnnpack'). Here are the top 10 time eaters:

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  model_inference         1.91%       9.592ms       100.00%     503.021ms     503.021ms             1  
                   quantized::add        41.99%     211.224ms        42.02%     211.352ms      11.124ms            19  
                quantized::conv2d        20.23%     101.774ms        20.45%     102.887ms     791.436us           130  
                   quantized::mul        11.50%      57.855ms        11.56%      58.131ms       2.236ms            26  
                    aten::sigmoid        11.39%      57.277ms        11.40%      57.321ms       2.205ms            26  
                 aten::dequantize        11.26%      56.655ms        11.29%      56.795ms     530.796us           107  
                      aten::silu_         0.05%     238.570us         0.72%       3.618ms      46.382us            78  
                       aten::silu         0.67%       3.379ms         0.67%       3.379ms      43.323us            78  
        aten::quantize_per_tensor         0.34%       1.694ms         0.34%       1.694ms      20.918us            81  
                       aten::mean         0.14%     719.232us         0.16%     803.762us      29.769us            27  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  

It’s sad that so much time is being taken by add and mul. The average time for one add is 11 ms! The convs are taking about as long as the FP32 version.

I also tried this with MobileNet V2 and got a 20x slowdown from 41 ms to 806 ms.

Is this on mobile or on server? I know there is some work being done on the speed of add and mul here: AVX512 and Vec512 · Issue #56187 · pytorch/pytorch · GitHub

if its on mobile, it may be necessary to involve the mobile team as well.