Quantization causing reduced performance on pytorch Android

I have following model, which I want to run on Android.

class depthwise_separable_conv(nn.Module):
    def __init__(self, nin, nout, kernel_size, kernels_per_layer=1):
        super(depthwise_separable_conv, self).__init__()
        self.depthwise = nn.Conv2d(nin, nin * kernels_per_layer, kernel_size=kernel_size, padding=1, groups=nin)
        self.pointwise = nn.Conv2d(nin * kernels_per_layer, nout, kernel_size=1)
        self.relu = nn.ReLU(inplace=False)

    def forward(self, x):
        out = self.depthwise(x)
        out = self.pointwise(out)
        out = self.relu(out)
        return out
    
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = depthwise_separable_conv(1, 6, 5)
        self.conv2 = depthwise_separable_conv(6, 16, 5)
        self.conv3 = depthwise_separable_conv(16, 32, 5)
        self.pool = nn.AvgPool2d(2, 2)
        self.lrn = nn.LocalResponseNorm(2)
        self.fc1 = nn.Linear(32 * 6 * 13, 250)
        self.relu1 = nn.ReLU(inplace=False)
        self.fc2 = nn.Linear(250, 84)
        self.relu2 = nn.ReLU(inplace=False)
        self.fc3 = nn.Linear(84, 2)
        self.soft = nn.Softmax(dim=1)
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.pool((self.conv1(x)))
        x = self.pool((self.conv2(x)))
        x = self.pool((self.conv3(x)))
        x = self.dequant(x)
        x = self.lrn(x)
        x = self.quant(x)
        x = x.reshape(-1, 32 * 6 * 13)
        x = self.relu1(self.fc1(x))
        x = self.relu2(self.fc2(x))
        x = self.fc3(x)
        x = self.dequant(x)
        x = self.soft(x)
        return x

I created two versions, with and without Quantization(this model doesn’t have quant() and dequant() parts).

I performed quantization using the following code

backend = "qnnpack"
qconfig = torch.quantization.get_default_qconfig(backend)
net.qconfig = qconfig
torch.backends.quantized.engine = backend

qconfig_dict = {"": qconfig}
quant_net = net
quant_net = prepare_fx(quant_net, qconfig_dict)
quant_net(torch.Tensor(batch))  #calibrate 
quant_net = convert_fx(quant_net)

and scripted both models using

traced_script_module = torch.jit.script(quant_net)
traced_script_module_optimized = optimize_for_mobile(traced_script_module)
traced_script_module_optimized._save_for_lite_interpreter(MODEL_DIR + "stQuant_lite.ptl")

In python, I’m able to see inference time reduction using the quantization. But, reverse happens on Android. Moreover, in Android, the RAM usage is also more in the case of quantized model.

Also, the weirdest thing is happening in Android. If I rename the scripted model “stQuant_lite.ptl” to “stQuant_lite_11.ptl”, keeping everything else same(I literally just use refactor->rename), I get the following error :

Could not run ‘quantized::conv2d.new’ with arguments from the ‘CPU’ backend.

Similarly, when I used a model named ‘lite.ptl’, I even got wrong output shape. However, when I renamed it to “_lite.plt”, it gave expected output.

I’m using torch 1.9.0 in python on a linux OS and pytorch_lite:1.9.0 in Android. My app is almost the HelloWorldApp, except that i feed an empty/zero_initialized FloatBuffer to the model instead of an Image.

Any help is greatly appreciated.

Regarding filename differences, it sounds like spooky action at distance. Super weird. But generally when you get this error “Could not run ‘quantized::conv2d.new’ with arguments from the ‘CPU’ backend.” it means that your quantized::conv2d op is getting float tensor as input.

Regarding runtime on android: are you comparing fp32 model runtime on android vs. quantized model runtime on android?

Regarding memory footprint, are you talking about peak memory or you are talking about average memory utilization? And what is the difference compared to fp32 model?

yeah, I saw that the same comment at multiple places, but the code runs using uint8, I’ve verified in python.

Yes, a lite version of both

I run a loop of 100 iterations to get better measurement of time consumption, Hence, peak and average are almost same(I guess, garbage collector can’t work fast enough to execute between iterations). I’m seeing around 100MB more memory usage for quantized model.

Is it possible for you to do print(model.graph) in some python file/shell and paste the output here?

I noticed that Quantization of MobileNet-V2 model was working fine for me, just like in the pytorch tutorial https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html?highlight=static%20quantization.
So, I started debugging by comparing with Mobilenet-V2. I noticed two changes, that helped my model to also achieve better performance.

  1. using _make_divisible(v, divisor, min_value=None) function(in the MobileNet-V2 code) or manually setting the number of channels to be divisible by 8.
  2. use padding = (kernel_size - 1) // 2 in convolution layers, instead of padding=0.

I changed the model to following:

class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        #padding = 0
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=True),
            #nn.BatchNorm2d(out_planes, momentum=0.1),
            # Replace with ReLU
            nn.ReLU(inplace=False)
        )
    def fuse(self):
        fuse_modules(self, ['0', '1'], inplace=True)
    
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        block = [
            ConvBNReLU(1,8,3,2),
            ConvBNReLU(8,16,3,1,8),
            nn.Conv2d(16,32,1,1),
            ConvBNReLU(32,8,1,1),
        ]
        self.features = nn.Sequential(*block)
        self.classifier = nn.Linear(8,2)
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.features(x)
        x = x.mean([2, 3])
        x = self.classifier(x)
        x = self.dequant(x)
        return x

and achieved ~38MB reduction in memory in Android by using Quantization. by using the above 2 points.