Quantized::cat running time is slower than fp32 model

Hey, I am working on quantized my model. My model is running on Mobile devices. The issue is when running quantized::cat op, the running speed is much slower than the dequantized one.
I have successfully print the running time conparsion between these two ops.

Logs for Quantized one:

Blockquote
OP total_time :146462us
—RUNNING 2244 OP 588 # %852 : Tensor = prim::ListConstruct(%c2_ffm.1, %851, %843, %835)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 2248 OP 589 # %input107.1 : Tensor = quantized::cat(%852, %8, %5, %6) # lib/python2.7/site-packages/torch/nn/quantized/modules/functional_modules.py:157:0
—input TensorList;Int;Double;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :3226438us

Logs for dequantized one:

Blockquote
—RUNNING 4103 OP 684 # %1264 : Tensor = prim::ListConstruct(%c2_ffm.1, %c3.1, %c4.1, %c50.1)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 4105 OP 685 # %input189.1 : Tensor = aten::cat(%1264, %8)
—input TensorList;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :281129us

I followed the offical quantization document used nn.quantized.FloatFunctional(), and call FloatFunctional.cat to concate all my tensors into one.
I wonder why the quantized::cat running time is much slower than the dequantized one.
I could dequant all my tensor first, and use torch.cat, which will save running time on concatation. But, Since all my tensor’s size is too large, I cannot afford to dequant all tensors first, which will make the running time even slower.

I’m using torch==1.3.1, torchvision==0.4.2

Thanks in Advance.

cc @Zafar can you take a look?

Did you find any workaround or fix to this problem? We are also facing speed issue, when Squeezenet is quantized. Speed of quantized Squeezenet is less than the speed of FP32 model on android device. Squeezenet is using ‘Concat’ operation at multiple places.

This is because torch quantized concat op just piggy back to FP32 by

dequnatize all inputs -> do concat in FP32 -> quantize concatenated tensor

So it will never be faster than FP32 concat. See the implementation

There are other operators that follow the same FP32 fallback approach (hence slower than FP32) such as quantized elemwise add, mul etc.