Eager Quantization: How to pass int to a quantized model?


I want to quantize a model so that I can pass int8 values directly into the model post quantization. However, the tutorials all seem to assume that I still pass fp32 which is then converted using QuantStub, so I am not really sure where to look for a better implementation

My code looks like this:

class ConvModel(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, num_classes=0):
        self.c1 = nn.Conv2d(in_channels, out_channels, kernel_size)
    def forward(self, x):
        x = self.c1(x)
        return x

# Quantize the model
input_fp = torch.rand(1, input_height, input_width, input_channels)
model = ConvModel(input_channels, output_channels, kernel_size)

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
model.qconfig = torch.ao.quantization.default_qconfig
torch.ao.quantization.prepare(model, inplace=True)

# Pseudo-Calibration

#Convert to quantized model
torch.ao.quantization.convert(model, inplace=True)

# Save the model.
torch.jit.save(torch.jit.script(model), "conv2d_model_scripted_quantized.pth")

#Generate expected output data
input_matrix = torch.randint(0, 128, (1, input_channels, input_height, input_width), dtype=torch.int8)
expected_output = model(input_matrix)

With this, I get the following error message:

NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, QuantizedCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

I want to perform full-integer quantization, as the hardware I want to deploy my model on only supports integer.
If I add the QuantStub layer I can at least run the model, but in that case the model wil assume that input, weight and bias are fp32. But for my use case I need all of these parameters to be available as int8.

Hi @Necrotos

This is because you are passing in a Tensor where the model is expecting a quantized Tensor.

Can you try quantizing your input tensor ahead of time using one of these functions? I believe that should work.

* torch.quantize_per_tensor(x, scale, zero_point, dtype)
* torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
* torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)

Thanks! With that, I can get rid of the QuantStub layers.
I have two follow-up questions regarding this approach, to prevent some follow-up issues:

  1. How do I get the correct scale and zero_point values?
  2. Is there a way to get my model to emit torch.uint8 or torch.int8 instead of the quint/qint variant?