How to approach quantizing torch.matmul/aten::bmm?

yannbane · September 21, 2022, 1:46pm

I have two quantized tensors:

In [14]: q.size()
Out[14]: torch.Size([64, 3, 49, 32])

In [15]: k.size()
Out[15]: torch.Size([64, 3, 49, 32])

I’m trying to run the following operation:

torch.matmul(q, k.transpose(-2, -1))

Which yields the usual error:

RuntimeError: Could not run 'aten::bmm' with arguments from the 'QuantizedCPU' backend.

I’m aware that matmul apparently isn’t supported in quantization. But is there a way to transform or use another operation that would allow these activations/tensors to remain quantized, but achieve the same outcome as matmul? Of course while retaining the potential performance benefit of quantization overall.

Also is there a timetable on when matmul ops will be supported?

Vasiliy_Kuznetsov · September 23, 2022, 4:04pm

If you use FX graph mode quantization, quantizing torch.matmul is supported. It’s not supported in Eager mode quantization.

yannbane · September 23, 2022, 4:20pm

Thanks for the info! FX graph mode quantization seemed really tough to set up for this model. Is there a reason why matmul isn’t supported in eager mode? I’m currently researching how to add the support myself, but if it’s a big technical issue I’d consider FX again.

Vasiliy_Kuznetsov · September 23, 2022, 4:40pm

It’s possible to do this in Eager mode, it’s just not something we plan to land in the codebase as the Eager mode workflow is not designed for handling functions, it’s better at handling modules. We have handling for some very common functions, but matmul is not one of them. If someone is interested in adding matmul to these workarounds, we’d be happy to accept a PR.

If you want to get it working for your model, you can follow what is done for functions such as torch.add. Specifically:

Start with the FloatFunctional and QFunctional classes (pytorch/functional_modules.py at 2f04ba2c7c8920418ad77ebb1ab09d93374e6578 · pytorch/pytorch · GitHub)
extend the classes for your operation (in your case, add a method for matmul)
rewrite your model, instead of calling torch.matmul you need to create an instance of your new class, and create float_functional.matmul(a, b) on it. Each call of the matmul should have its own instance of the class, so that they can collect statistics separately.
add your new classes to pytorch/quantization_mappings.py at 2f04ba2c7c8920418ad77ebb1ab09d93374e6578 · pytorch/pytorch · GitHub . This will tell the prepare and convert functions that your new classes exist.
run the quantization flow

Here is a test case which uses this logic, in case that is helpful: pytorch/test_quantize_eager_qat.py at 7088a98fba3a5031a2afc293cbf25cec09f248a5 · pytorch/pytorch · GitHub

Vasiliy_Kuznetsov · September 23, 2022, 4:41pm

To clarify, you don’t need to change PyTorch source code for this, you just need to create some custom classes, tell prepare/convert about these classes, and rewrite your model to call these classes.

yannbane · September 23, 2022, 5:18pm

Wow thanks! I kind of went through that workflow to add support for a quantized softmax. But I am confused: the bindings for quantized softmax were already accessible: torch.ops.quantized.softmax(x, self.dim, self.scale, self.zero_point), so I just had to instruct Pytorch to convert nn.Softmax into my extension of FloatFunctional.

But as I understand it, there are no such bindings for matmul exposed by ATen available. I would actually have to add these to ATen myself, and then rebuild Pytorch.

Am I mistaken?

Vasiliy_Kuznetsov · September 23, 2022, 5:36pm

The quantized version of torch.matmul was added in [PyTorch Edge] Add Quantized Matmul Op (Naive Implementation) by salilsdesai · Pull Request #71783 · pytorch/pytorch · GitHub, if you are on the latest version of PyTorch (1.12) you should be able to use it.

yannbane · September 23, 2022, 5:45pm

Ah I missed that. But doesn’t it say:

Summary: Adds Quantized Matmul op which just naively performs dequantize → matmul → quantize. (To be optimized in the future)

So basically it’s kind of doing what I’m already doing? And if I want a true quantized op I’d still have to make it myself?

Vasiliy_Kuznetsov · September 23, 2022, 9:27pm

I just linked the PR which added the original function, there have been some improvements on it landed. You could check the current state here: pytorch/qmatmul.cpp at master · pytorch/pytorch · GitHub . Whether the improvements would be useful for you would depend on whether your backend supports the ruy library.

yannbane · September 23, 2022, 9:52pm

In the meantime I’ve discovered that yes, but now I’m trying to compile Pytorch with that flag enabled and can’t to get the development docker image built. Made a new thread about that: Can't build Pytorch using the Dockerfile from the repo - #3 by yannbane