Hey, I am working on quantized my model. My model is running on Mobile devices. The issue is when running quantized::cat op, the running speed is much slower than the dequantized one.
I have successfully print the running time conparsion between these two ops.
Logs for Quantized one:
Blockquote
OP total_time :146462us
—RUNNING 2244 OP 588 # %852 : Tensor = prim::ListConstruct(%c2_ffm.1, %851, %843, %835)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 2248 OP 589 # %input107.1 : Tensor = quantized::cat(%852, %8, %5, %6) # lib/python2.7/site-packages/torch/nn/quantized/modules/functional_modules.py:157:0
—input TensorList;Int;Double;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :3226438us
Logs for dequantized one:
Blockquote
—RUNNING 4103 OP 684 # %1264 : Tensor = prim::ListConstruct(%c2_ffm.1, %c3.1, %c4.1, %c50.1)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 4105 OP 685 # %input189.1 : Tensor = aten::cat(%1264, %8)
—input TensorList;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :281129us
I followed the offical quantization document used nn.quantized.FloatFunctional(), and call FloatFunctional.cat to concate all my tensors into one.
I wonder why the quantized::cat running time is much slower than the dequantized one.
I could dequant all my tensor first, and use torch.cat, which will save running time on concatation. But, Since all my tensor’s size is too large, I cannot afford to dequant all tensors first, which will make the running time even slower.
I’m using torch==1.3.1, torchvision==0.4.2
Thanks in Advance.