Why quantized_conv2d layer is much slower than fp32

jhw4824 · February 17, 2022, 6:54am

Hello.

When I applied Quantization to a convolution based model, I found the inference latency is increased.
For specific example, a latency of a single conv2d layer(input_channel:4, output_channel:16) is about 1.249ms which is measured by PyTorch Profiler. After quantization, the latency of the same layer is about 4.716ms

I think the reason might be the output_channel is bigger than input_channel. is this right?
and I want to know the specific purpose of the quantized conv layer’s followings(e.g. aten::contiguous, aten::empty_like etc) If I want to know about these operators, where should I look?

Thanks in advance for your help.

andrewor · February 23, 2022, 12:18am

Hi Je,

What qengine and hardware are you running this on? Unfortunately individual quantized ops are not heavily optimized at the moment. To speed up the overall model however, you can try to fuse adjacent layers together. For the other ops, you can find more information on the pytorch API docs (e.g. empty_like and contiguous), and the “aten” you see just refers to the native C++ implementation of these ops.

Best,
-Andrew