This has already come up here but I feel it was not addressed properly.
I’m facing this issue with FX tracing on timm
's efficientnet_b3
. I get 10x speedup on regular backend but a 4x slow down on qnnpack (torch.backends.quantized.engine = 'qnnpack'
). Here are the top 10 time eaters:
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 1.91% 9.592ms 100.00% 503.021ms 503.021ms 1
quantized::add 41.99% 211.224ms 42.02% 211.352ms 11.124ms 19
quantized::conv2d 20.23% 101.774ms 20.45% 102.887ms 791.436us 130
quantized::mul 11.50% 57.855ms 11.56% 58.131ms 2.236ms 26
aten::sigmoid 11.39% 57.277ms 11.40% 57.321ms 2.205ms 26
aten::dequantize 11.26% 56.655ms 11.29% 56.795ms 530.796us 107
aten::silu_ 0.05% 238.570us 0.72% 3.618ms 46.382us 78
aten::silu 0.67% 3.379ms 0.67% 3.379ms 43.323us 78
aten::quantize_per_tensor 0.34% 1.694ms 0.34% 1.694ms 20.918us 81
aten::mean 0.14% 719.232us 0.16% 803.762us 29.769us 27
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
It’s sad that so much time is being taken by add
and mul
. The average time for one add
is 11 ms! The convs are taking about as long as the FP32 version.
I also tried this with MobileNet V2 and got a 20x slowdown from 41 ms to 806 ms.