Why quantized_conv2d layer is much slower than fp32


When I applied Quantization to a convolution based model, I found the inference latency is increased.
For specific example, a latency of a single conv2d layer(input_channel:4, output_channel:16) is about 1.249ms which is measured by PyTorch Profiler. After quantization, the latency of the same layer is about 4.716ms

I think the reason might be the output_channel is bigger than input_channel. is this right?
and I want to know the specific purpose of the quantized conv layer’s followings(e.g. aten::contiguous, aten::empty_like etc) If I want to know about these operators, where should I look?

Thanks in advance for your help.

Hi Je,

What qengine and hardware are you running this on? Unfortunately individual quantized ops are not heavily optimized at the moment. To speed up the overall model however, you can try to fuse adjacent layers together. For the other ops, you can find more information on the pytorch API docs (e.g. empty_like and contiguous), and the “aten” you see just refers to the native C++ implementation of these ops.