Hello. I have a question about convert in torch.quantization.
For a model like this,
(l1): Linear(in_features=784, out_features=10, bias=True)
After QAT and convert, I got
(l1): QuantizedLinear(in_features=784, out_features=10, scale=0.5196203589439392, zero_point=78, qscheme=torch.per_channel_affine)
But, I’m looking for a way to do an evaluation on CUDA, and in that sense, I need to convert it back to the pre-QAT model yet with ‘quantized FP32’ weights and perhaps custom forward_hook to perform activation quantization. Can someone advise the best way to achieve this? In my understanding, these are the steps but like to ensure I don’t reinvent the wheel here.
- write a new converter to get the pre-QAT model architecture and load quantized weight (but, in FP32).
- add forward_prehook that does quantization per scale/zero_point from activation_post_process
(should it be forward_prehook or forward_posthook??)
Any suggestions would be appreciated!