Maintaining sparsity when quantizing

Hello! I have a pre-trained pruned network with 75% sparsity.
I would like to apply quantization to this network such that its sparsity is maintained during inference. I’ve opted to use symmetric quantization for this, and it’s my understanding that the zero point should be 0. However, I get zero_point=128. I place below a snippet of my code:

model.eval()
model.to('cpu')
quantization_config = torch.quantization.QConfig(activation=torch.quantization.MinMaxObserver.with_args(dtype=torch.quint8, qscheme=torch.per_tensor_symmetric), weight=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
model.qconfig = quantization_config
quant_model = torch.quantization.prepare(model)
calibrate(quant_model, train_loader, batches_per_epoch)

When printing quant_model this is the output:

VGGQuant(
  (features): Sequential(
    (0): QuantizedConv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.07883524149656296, zero_point=128, padding=(1, 1))
    (1): QuantizedBatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.05492561683058739, zero_point=128, padding=(1, 1))
    (4): QuantizedBatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): QuantizedConv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.05388055741786957, zero_point=128, padding=(1, 1))
    (8): QuantizedBatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU(inplace=True)
    (10): QuantizedConv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.03040805645287037, zero_point=128, padding=(1, 1))
    (11): QuantizedBatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (12): ReLU(inplace=True)
    (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (14): QuantizedConv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.023659387603402138, zero_point=128, padding=(1, 1))
    (15): QuantizedBatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU(inplace=True)
    (17): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.01725710742175579, zero_point=128, padding=(1, 1))
    (18): QuantizedBatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU(inplace=True)
    (20): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.013385827653110027, zero_point=128, padding=(1, 1))
    (21): QuantizedBatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU(inplace=True)
    (23): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.011628611013293266, zero_point=128, padding=(1, 1))
    (24): QuantizedBatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (25): ReLU(inplace=True)
    (26): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (27): QuantizedConv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.00966070219874382, zero_point=128, padding=(1, 1))
    (28): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (29): ReLU(inplace=True)
    (30): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.006910551339387894, zero_point=128, padding=(1, 1))
    (31): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (32): ReLU(inplace=True)
    (33): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.002619387349113822, zero_point=128, padding=(1, 1))
    (34): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (35): ReLU(inplace=True)
    (36): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.002502179006114602, zero_point=128, padding=(1, 1))
    (37): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (38): ReLU(inplace=True)
    (39): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (40): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.00118942407425493, zero_point=128, padding=(1, 1))
    (41): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (42): ReLU(inplace=True)
    (43): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.0017956980736926198, zero_point=128, padding=(1, 1))
    (44): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (45): ReLU(inplace=True)
    (46): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.0021184098441153765, zero_point=128, padding=(1, 1))
    (47): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (48): ReLU(inplace=True)
    (49): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.0019303301814943552, zero_point=128, padding=(1, 1))
    (50): QuantizedBatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (51): ReLU(inplace=True)
    (52): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (53): AvgPool2d(kernel_size=1, stride=1, padding=0)
  )
  (classifier): QuantizedLinear(in_features=512, out_features=10, scale=0.0953117236495018, zero_point=128, qscheme=torch.per_tensor_affine)
  (quant): Quantize(scale=tensor([0.0216]), zero_point=tensor([128]), dtype=torch.quint8)
  (dequant): DeQuantize()

Should I use a different quantization scheme? Is there something I’m missing? I’d like for zero point to be 0 for all layers.

Well, if you want 0 to map to 0, I think you want signed integers rather than unsigned ones (which are the default), so use qint8 instead of quint8. Note that this may impact the operator coverage.
You probably know this, but just in case: for the sparsity to lead to less computation, you need special “structured sparse kernels”. I think they are being worked on, but it’s not what you get today.

Best regards

Thomas