Custom Quantization using PT2 -> q/dq representation

I am trying to run a post training quantized linear layer using PT2 quantization flow. However the final quantized model using convert_pt2e api result in linear matmul op in fp32 precision and quant/dequant ops are placed in succession in the graph.

Eg. The linear layer I am testing is: nn.Linear(5,10)->RELU().
Quantization using XNNPACK quantizer result in following quantized model:

The pattern here is q->dq->linear_op(fp32)
How can we specify quantization to be q->linear_op(lower precision)->dq?

this is expected, see (prototype) PyTorch 2 Export Post Training Quantization — PyTorch Tutorials 2.1.0+cu121 documentation, you’ll need to lower the model to the target backend in order to get a quantized linear op, please stay tuned for PyTorch Conference that’s happening next week: PyTorch Conference | Linux Foundation Events we have some announcements in PyTorch Edge that will show how to do this for XNNAPCK.

1 Like