Hey, I am working on quantized my model. My model is running on Mobile devices. The issue is when running quantized::cat op, the running speed is much slower than the dequantized one.

I have successfully print the running time conparsion between these two ops.

Logs for Quantized one:

Blockquote

OP total_time :146462us

—RUNNING 2244 OP 588 # %852 : Tensor[] = prim::ListConstruct(%c2_ffm.1, %851, %843, %835)

—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];

—output TensorList;

OP total_time :15us

—RUNNING 2248 OP 589 # %input107.1 : Tensor = quantized::cat(%852, %8, %5, %6) # lib/python2.7/site-packages/torch/nn/quantized/modules/functional_modules.py:157:0

—input TensorList;Int;Double;Int;

—output Tensor:[1, 512, 240, 320];

OP total_time :3226438us

Logs for dequantized one:

Blockquote

—RUNNING 4103 OP 684 # %1264 : Tensor[] = prim::ListConstruct(%c2_ffm.1, %c3.1, %c4.1, %c50.1)

—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];

—output TensorList;

OP total_time :15us

—RUNNING 4105 OP 685 # %input189.1 : Tensor = aten::cat(%1264, %8)

—input TensorList;Int;

—output Tensor:[1, 512, 240, 320];

OP total_time :281129us

I followed the offical quantization document used nn.quantized.FloatFunctional(), and call FloatFunctional.cat to concate all my tensors into one.

I wonder why the quantized::cat running time is much slower than the dequantized one.

I could dequant all my tensor first, and use torch.cat, which will save running time on concatation. But, Since all my tensor’s size is too large, I cannot afford to dequant all tensors first, which will make the running time even slower.

I’m using torch==1.3.1, torchvision==0.4.2

Thanks in Advance.