ConvBnReLU quantized performance

I was trying to quantize FBNet model in PyTorch.
Quantized version has several times bigger latency than fp32.
but on raspberry pi it gives some gain in latency but still slow.

here is the result of small benchmark (just one Conv+Bn+ReLU)

Is this an expected behavior ?

import torch
import torch.nn as nn
import torch.nn.quantized as nnq
from torch.quantization import QuantStub, DeQuantStub

class ConvBNRelu(nn.Module):
    def __init__(self, cfg):
        super(ConvBNRelu, self).__init__()
        self.conv = nn.Conv2d(in_channels=cfg['in_channels'],
                              padding=cfg['padding']) = nn.BatchNorm2d(num_features=cfg['out_channels'])
        self.relu = nn.ReLU()
    def forward(self, x):
        x = self.conv(x)
        x =
        x = self.relu(x)
        return x
class Backbone(nn.Module):
    def __init__(self):
        super(Backbone, self).__init__()
        cfg = {
            'stem' : {
                'in_channels' : 3,
                'out_channels' : 32,
                'kernel_size' : (3, 3),
                'stride' : (2, 2),
                'padding' : (1, 1),
        self.stem = ConvBNRelu(cfg['stem'])
        self.quant = QuantStub()
        self.dequant = DeQuantStub()
    def forward(self, x):
        x = self.quant(x)
        x = self.stem(x)
        x = self.dequant(x)
        return x
    def fuse_model(self):
        torch.quantization.fuse_modules(self.stem, ['conv', 'bn', 'relu'], inplace=True)

model = Backbone()
x = torch.randn([1, 3, 34, 320], dtype=torch.float)*10

%timeit model(x) 

model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
torch.backends.quantized.engine = 'qnnpack'
torch.quantization.prepare(model, inplace=True);
with torch.no_grad():
    for i in range(5):
torch.quantization.convert(model, inplace=True);

%timeit model(x) 

Thanks for the post. It seems to be some perf issue of the int8 kernel implementation for this specific shape. We will be investigating it.

Are you just trying this shape (in_channels = 3, out_channels=32, kernel_size=3, stride=2 and padding=1) or is this actually used in a network? The reason I ask this is because usually the first conv is in_channels = 3, out_channels=64, kernel_size=7, stride=2 and padding=3 and we do have optimized fast path for this in fbgemm.

This conv2 params are actually used in FBNet like architectures from maskrrcnn_benchmark in “first” block. (to be more precise Chamnet-Mobile like arch )

Oh, this is a typo

x = torch.randn([1, 3, 34, 320], dtype=torch.float)*10  

It should be [1, 3, 320, 320], but this doesn’t change the ratios of time mesurements