Is there an alternative to do batched matrix multiplication on Quantized Tensors?

harshsaini · June 18, 2020, 10:50pm

Hi,

I am trying to do post training static quantization, however, I am running into issues where certain operations are not defined for QuantizedCPUTensorId.

Minimal reproducible example:

>>> import torch
>>> 
>>> A = torch.Tensor([[2,2], [3,3]]).unsqueeze(0)
>>> B = torch.Tensor([[2,3], [2,3]]).unsqueeze(0)
>>> scale, zero_point, dtype = 1.0, 2, torch.qint8
>>> qA = torch.quantize_per_tensor(A, scale, zero_point, dtype)
>>> qB = torch.quantize_per_tensor(B, scale, zero_point, dtype)
>>> torch.matmul(A,B)
tensor([[[ 8., 12.],
         [12., 18.]]])
>>> torch.matmul(qA,qB)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Could not run 'aten::bmm' with arguments from the 'QuantizedCPUTensorId' backend. 'aten::bmm' is only available for these backends: [CPUTensorId, VariableTensorId].

Are there alternatives to accomplishing the same?
I know there are certain operations that are defined here: https://pytorch.org/docs/stable/quantization.html#floatfunctional but what would be the optimal way?

supriyar · June 19, 2020, 9:38pm

If possible try using nn.Linear instead of aten::bmm.

Currently the only way is to implement the quantized operator for aten::bmm.
One easy way could be by implementing the quantized::linear operator by looping over the batch dimension. We will be looking into implementing this operator in the future.

harshsaini · June 20, 2020, 1:35am

Hi @supriyar, thanks for the response.

Yes, I had thought about that but wouldn’t that operation be suboptimal? However, if there is no alternative, I guess it would have to be so for now.

harshsaini · June 22, 2020, 6:32pm

Seems like https://pytorch.org/docs/stable/quantization.html#torch.nn.quantized.functional.linear is not a viable option. It requires the input tensor to be unsigned, however, the operation explicitly is between two tensors that are qint8.

>>> torch.nn.quantized.functional.linear(qA[0,], qB[0,])
RuntimeError: expected scalar type QUInt8 but found QInt8

jerryzh168 · June 23, 2020, 4:13pm

do you need both of the inputs to be qint8? If you change qA to be quint8 it would work

itayhubara · October 12, 2020, 10:22am

Any update about this one? Are you going to support it in the near future?

ignatius · January 20, 2021, 7:22pm

I’ve came across with the following code. As pointed out I perform the matrix multiplication with nn.Linear.

What do you think?

import torch
import torch.nn as nn

class BatchedMatMul(nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.quantization.QuantStub()
        self.linear = nn.Linear(3,3, bias=False)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, input1, input2):
        y = []
        for b in range(input1.shape[0]):
          print(f"Linear's type: {type(self.linear)}")
          print(f"Linear's weigth type: {type(self.linear.weight)}")
          if isinstance(self.linear.weight, nn.Parameter):
            self.linear.weight.requires_grad = False
            self.linear.weight.copy_ (self.quant(input1[b]))
            y.append(self.linear(self.quant(input2[b])))
          else:
            scale = self.linear.weight().q_per_channel_scales()
            zero_point = self.linear.weight().q_per_channel_zero_points()
            w = torch.quantize_per_channel(input1[b], scale, zero_point, 1, torch.qint8)
            self.linear.set_weight_bias(w, b=None)
            y.append(self.linear(self.quant(input2[b])))
          
        return self.dequant(torch.stack(y))

print("Cronstruct model...")
matmul = BatchedMatMul()
print("Cronstruct model... [OK]")

matmul.eval()
print("Running FP32 inference...")
inp = torch.ones(3, 3).repeat(2,1,1)
y = matmul(2*inp, inp)
print("FP32 output...")
print(y)
print("Running FP32 inference... [OK]")

print("Quantizing...")
matmul.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
matmul_prepared = torch.quantization.prepare(matmul)
matmul_prepared(2*inp, inp)
model_int8 = torch.quantization.convert(matmul_prepared)
print("Quantizing... [OK]")
print("Running INT8 inference...")
y = model_int8.forward(2*inp, inp)
print("Int8 Output")
print(y)
print("Running INT8 inference..[OK]")

Output:

Cronstruct model...
Cronstruct model... [OK]
Running FP32 inference...
Linear's type: <class 'torch.nn.modules.linear.Linear'>
Linear's weigth type: <class 'torch.nn.parameter.Parameter'>
Linear's type: <class 'torch.nn.modules.linear.Linear'>
Linear's weigth type: <class 'torch.nn.parameter.Parameter'>
FP32 output...
tensor([[[6., 6., 6.],
         [6., 6., 6.],
         [6., 6., 6.]],

        [[6., 6., 6.],
         [6., 6., 6.],
         [6., 6., 6.]]])
Running FP32 inference... [OK]
Quantizing...
Linear's type: <class 'torch.nn.modules.linear.Linear'>
Linear's weigth type: <class 'torch.nn.parameter.Parameter'>
Linear's type: <class 'torch.nn.modules.linear.Linear'>
Linear's weigth type: <class 'torch.nn.parameter.Parameter'>
Quantizing... [OK]
Running INT8 inference...
Linear's type: <class 'torch.nn.quantized.modules.linear.Linear'>
Linear's weigth type: <class 'method'>
Linear's type: <class 'torch.nn.quantized.modules.linear.Linear'>
Linear's weigth type: <class 'method'>
Int8 Output
tensor([[[5.9695, 5.9695, 5.9695],
         [5.9695, 5.9695, 5.9695],
         [5.9695, 5.9695, 5.9695]],

        [[5.9695, 5.9695, 5.9695],
         [5.9695, 5.9695, 5.9695],
         [5.9695, 5.9695, 5.9695]]])
Running INT8 inference..[OK]
/usr/local/lib/python3.6/dist-packages/torch/quantization/observer.py:121: UserWarning: Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.
  reduce_range will be deprecated in a future release of PyTorch."

jerryzh168 · February 5, 2021, 11:57pm

currently we only support quint8 for activations and qint8 for weight I think.

Currently we do not have plans for supporting bmm, one workaround is to put DeQuantStub and QuantStub around bmm op to skip quantizing it.