How to skip quantization of certain layers in static PTQ?

  • PyTorch version: 1.8.1

I’ve applied static PTQ by changing the source code of Swin Transformer from mmclassification (inserting de/quant stubs). I’ve had to change the source code of a lot of other files too, because the model is not exactly standard or simple.

I used this function to perform the quantization of a loaded model:

def static_quantize(m, data_loader):
    backend = 'qnnpack'
    torch.backends.quantized.engine = backend
    m.eval()

    m.qconfig = torch.quantization.get_default_qconfig(backend)
    torch.quantization.prepare(m, inplace=True)

    with torch.no_grad():
        for i, data in enumerate(data_loader):
            result = m(return_loss=False, **data)
            if i > 100:
                break
        
    torch.quantization.convert(m, inplace=True)

    return m # I realize this is unnecessary

However I’ve noticed a significant drop in accuracy (around 30%). To combat this I would like to selectively quantize layers, i.e. skip the quantization process for certain layers that are problematic. I’ve noticed that prepare and convert effectively quantize everything they can in the model recursively.

For me the simplest way of doing this would be to comment out the de/quant ops in the model source code. Of course this doesn’t actually work because these stubs aren’t used to detect which layers should be quantized.

So how can I tell prepare which layers to skip? Furthermore how can I tell it to skip one Linear layer, but quantize some other Linear layer (if type-level granularity is not enough)?

It turns out that the accuracy drop was not due to excessive quantization, but due to reusing the same QuantStubs in different place of the same class. These objects seem to have a dual purpose of collecting quantization statistics and flagging quantization boundaries (which does make sense). They’re stateful.