Quantized::cat running time is slower than fp32 model

meis725 · February 5, 2020, 8:48am

Hey, I am working on quantized my model. My model is running on Mobile devices. The issue is when running quantized::cat op, the running speed is much slower than the dequantized one.
I have successfully print the running time conparsion between these two ops.

Logs for Quantized one:

Blockquote
OP total_time :146462us
—RUNNING 2244 OP 588 # %852 : Tensor = prim::ListConstruct(%c2_ffm.1, %851, %843, %835)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 2248 OP 589 # %input107.1 : Tensor = quantized::cat(%852, %8, %5, %6) # lib/python2.7/site-packages/torch/nn/quantized/modules/functional_modules.py:157:0
—input TensorList;Int;Double;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :3226438us

Logs for dequantized one:

Blockquote
—RUNNING 4103 OP 684 # %1264 : Tensor = prim::ListConstruct(%c2_ffm.1, %c3.1, %c4.1, %c50.1)
—input Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];Tensor:[1, 128, 240, 320];
—output TensorList;
OP total_time :15us
—RUNNING 4105 OP 685 # %input189.1 : Tensor = aten::cat(%1264, %8)
—input TensorList;Int;
—output Tensor:[1, 512, 240, 320];
OP total_time :281129us

I followed the offical quantization document used nn.quantized.FloatFunctional(), and call FloatFunctional.cat to concate all my tensors into one.
I wonder why the quantized::cat running time is much slower than the dequantized one.
I could dequant all my tensor first, and use torch.cat, which will save running time on concatation. But, Since all my tensor’s size is too large, I cannot afford to dequant all tensors first, which will make the running time even slower.

I’m using torch==1.3.1, torchvision==0.4.2

Thanks in Advance.

jerryzh168 · February 14, 2020, 6:51pm

cc @Zafar can you take a look?

vgsprasad · February 17, 2020, 5:42am

Did you find any workaround or fix to this problem? We are also facing speed issue, when Squeezenet is quantized. Speed of quantized Squeezenet is less than the speed of FP32 model on android device. Squeezenet is using ‘Concat’ operation at multiple places.

masahi · February 17, 2020, 9:36pm

This is because torch quantized concat op just piggy back to FP32 by

dequnatize all inputs -> do concat in FP32 -> quantize concatenated tensor

So it will never be faster than FP32 concat. See the implementation

github.com

pytorch/pytorch/blob/f6c46df856d0588360db5b807960d1fc5e888c36/aten/src/ATen/native/quantized/cpu/qconcat.cpp#L61-L66


  xs.push_back(qx.dequantize());
}
const Tensor y = at::cat(xs, dim);
Tensor qy;
AT_DISPATCH_QINT_TYPES(x_dtype, "qcat", [&]() {
  qy = at::quantize_per_tensor(y, scale, zero_point, SCALAR_TYPE);

There are other operators that follow the same FP32 fallback approach (hence slower than FP32) such as quantized elemwise add, mul etc.