QuantizedConv2d with stride=(2,2) works extremely slow

I’ve discovered that issue when I was trying to understand why does my quantized version of EfficientNet-b4 work slower than the float one.

I use torch 1.6.0, fbgemm default config (torch.quantization.get_default_qconfig('fbgemm'))

torchprof library shows that _depthwise_conv in blocks 2, 6, 10, 22 are the key problem: they are 3-10 times slower than in float model!
These layers differ from others only with (2, 2) stride instead of (1,1), (3,3), (5,5). I do not give the full list of hassle-free layers in the screenshot, but I’ve checked that

Is this the expected behavior? What could be the reason?

How do you do the quantization? manually or through graph quantization?
Does graph quantization give you the same issues?

cc @dskhudia for whether this is expected in fbgemm

It works much better if you use equal padding for your 2, 6, 10 and 22 depthwise. For equal padding, depthwise conv goes through a fast path.

equal padding for (3, 3) kernel would be (1, 1) and (5, 5) it would be (2, 2). (Similar to the padding you have for (3, 3) and (5, 5) kernel sizes for other convolutions).

2 Likes

unfortunately, I do not understand what do you mean by saying “manually or through graph quantization”

If you mean, do I profile Torchscript model or original python one: I’ve done both, and both of them show bad performance. The layer-by-layer profiling via torchprof does not work with TorchScript models, but I expect that “slow” layers stay slow both in original model and in a Torchscript one, even if timing original model is not so precise

Quantization is done the same way as in MobileNet tutorial.
Here is the code snippet:

def try_config(qconfig, one_thread_inference=True, calibration_max_batches=None, metrics_max_batches=None):
    # Fuse modules  (my own implementation for EfficientNet; do you need it?)
    q_model = fuse_modules()
    
    # apply config (this looks complex because I do not quantize _conv_stem (1st convolution)
    for block in q_model.feature_extractor._blocks:
        block.qconfig = qconfig    
    q_model.quant = QuantStub(qconfig)
    print(qconfig)

    torch.quantization.prepare(q_model, inplace=True,
                               white_list=(
                                   torch,nn.Conv2d,
                                   torch.nn.BatchNorm2d,
                                   torch.quantization.stubs.QuantStub,
                                   torch.nn.quantized.modules.functional_modules.FloatFunctional
                               ))
    q_model.eval()
    
    print('Post Training Quantization Prepare: Inserting Observers')
    if one_thread_inference:
        torch.set_num_threads(multiprocessing.cpu_count())
    inference(q_model, dev_loader, max_batches=calibration_max_batches)  # custom func just for model inference

    print('Post Training Quantization: Calibration done')

    # Convert to quantized model
    torch.quantization.convert(q_model, inplace=True)
    print('Post Training Quantization: Convert done')