Quantization causing reduced performance on Android

Lakshya1 · August 13, 2021, 10:38am

I have following model, which I want to run on Android.


class depthwise_separable_conv(nn.Module):
    def __init__(self, nin, nout, kernel_size, kernels_per_layer=1):
        super(depthwise_separable_conv, self).__init__()
        self.depthwise = nn.Conv2d(nin, nin * kernels_per_layer, kernel_size=kernel_size, padding=1, groups=nin)
        self.pointwise = nn.Conv2d(nin * kernels_per_layer, nout, kernel_size=1)
        self.relu = nn.ReLU(inplace=False)

    def forward(self, x):
        out = self.depthwise(x)
        out = self.pointwise(out)
        out = self.relu(out)
        return out
    
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = depthwise_separable_conv(1, 6, 5)
        self.conv2 = depthwise_separable_conv(6, 16, 5)
        self.conv3 = depthwise_separable_conv(16, 32, 5)
        self.pool = nn.AvgPool2d(2, 2)
        self.lrn = nn.LocalResponseNorm(2)
        self.fc1 = nn.Linear(32 * 6 * 13, 250)
        self.relu1 = nn.ReLU(inplace=False)
        self.fc2 = nn.Linear(250, 84)
        self.relu2 = nn.ReLU(inplace=False)
        self.fc3 = nn.Linear(84, 2)
        self.soft = nn.Softmax(dim=1)
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.pool((self.conv1(x)))
        x = self.pool((self.conv2(x)))
        x = self.pool((self.conv3(x)))
        x = self.dequant(x)
        x = self.lrn(x)
        x = self.quant(x)
        x = x.reshape(-1, 32 * 6 * 13)
        x = self.relu1(self.fc1(x))
        x = self.relu2(self.fc2(x))
        x = self.fc3(x)
        x = self.dequant(x)
        x = self.soft(x)
        return x

I created two versions, with and without Quantization(this model doesn’t have quant() and dequant() parts).

I performed quantization using the following code

backend = "qnnpack"
qconfig = torch.quantization.get_default_qconfig(backend)
net.qconfig = qconfig
torch.backends.quantized.engine = backend

qconfig_dict = {"": qconfig}
quant_net = net
quant_net = prepare_fx(quant_net, qconfig_dict)
quant_net(torch.Tensor(batch))  #calibrate 
quant_net = convert_fx(quant_net)

and scripted both models using

traced_script_module = torch.jit.script(quant_net)
traced_script_module_optimized = optimize_for_mobile(traced_script_module)
traced_script_module_optimized._save_for_lite_interpreter(MODEL_DIR + "stQuant_lite.ptl")

In python, I’m able to see inference time reduction using the quantization. But, reverse happens on Android. Moreover, in Android, the RAM usage is also more in the case of quantized model.

Also, the weirdest thing is happening in Android. If I rename the scripted model “stQuant_lite.ptl” to “stQuant_lite_11.ptl”, keeping everything else same(I literally just use refactor->rename), I get the following error :

Could not run ‘quantized::conv2d.new’ with arguments from the ‘CPU’ backend.

I’m using torch 1.9.0 in python on a linux OS and pytorch_lite:1.9.0 in Android. My app is almost the HelloWorldApp, except that i feed an empty FloatBuffer to the model instead of an Image.

Any help is greatly appreciated.

jerryzh168 · August 14, 2021, 3:52am

both questions sounds like related more to PyTorch Mobile actually, could you add a PyTorch Mobile tag?

Lakshya1 · August 15, 2021, 12:56pm

Sorry, I’m new and I can’t see any edit topic options to change the category to mobile.

jerryzh168 · August 16, 2021, 4:57am

could you make a post in PyTorch Mobile: Mobile - PyTorch Forums? we can close this one

Lakshya1 · August 16, 2021, 5:16am

Created another topic : https://discuss.pytorch.org/t/quantization-causing-reduced-performance-on-pytorch-android/129434