Performance of a Quantization Aware Trained model without converting

Hello, this is my first post.

I wanted to know If I want to perform quantization aware training (QAT) and build the model with the torch.nn.intrinsic modules, will the performance of the model drop even if I don’t convert it to a quantized model?
Basically I want to build a model that can achieve it’s best possible performance on a GPU and can be converted into a quantized model whenever needed.

Thanks in advance!

Hi @pritom-kun
Can you clarify what you mean by performance of the model here? Are you referring to model numerics or the training time due to enabling QAT on the model?

Regarding the numerics, the QAT step does alter the model weights assuming the model will be quantized at a later stage, so there may be differences if we compare the result with an FP32 trained model.

That’s what I wanted to know, by the performance I was referring to the prediction accuracy. Thanks for your answer!