Dynamic quantize timm models

I try to quantize my pretrained model from timm library. But it not work, so the question is why it not work and how to make timm models being quantized?

import timm 

model = timm.create_model('mobilenetv2_120d', pretrained=True)

model_int8 = torch.quantization.quantize_dynamic(
    model,  # the original model
    {torch.nn.Linear, torch.nn.Conv2d, torch.nn.ReLU, torch.nn.BatchNorm2d}, 
    dtype=torch.qint8) 

print_model_size(model)
print_model_size(model_int8)
23.74 MB
23.74 MB

We currently only support dynamic quantization of Linear operations from the list you’ve specified. Can you print the quantized model to check how many layers were actually quantized?

Yes it truly work only for Linear.

(conv_head): Conv2d(384, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn2): BatchNorm2d(1280, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (act2): ReLU6(inplace=True)
    (global_pool): SelectAdaptivePool2d (pool_type=avg, flatten=True)
    (classifier): DynamicQuantizedLinear(in_features=1280, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine)

But my goal is to measure evaluation time of quantized model and compare this time with float32 model. For that i try to static quantization:

model_sigmoid.qconfig = torch.quantization.get_default_qconfig('qnnpack')

# insert observers
torch.quantization.prepare(model_sigmoid, inplace=True)
# Calibrate the model and collect statistics

# convert to quantized version
torch.quantization.convert(model_sigmoid, inplace=True)

This code quantize all the layer. But i cant run this quantized model,because of that:

start_time = time.time()

with torch.no_grad():
# with torch.autograd.set_detect_anomaly(True):
    pred = model_sigmoid(torch_img)

print('Time = ', time.time() - start_time)
RuntimeError: Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. 'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

I understand that this backend not support cpu and cuda, so question is it possible to run this static quantized model on windows 10 (x64)? And it will be cool if you compare each backed in RuntimeError with device on which it can evaluete.

For static quantization, in addition to using the qconfig you also need to add Quant/Dequant Stubs around the modules you want quantized.

The tutorial (beta) Static Quantization with Eager Mode in PyTorch — PyTorch Tutorials 1.8.1+cu102 documentation has more details on how to do so.