I wanted to know If I want to perform quantization aware training (QAT) and build the model with the torch.nn.intrinsic modules, will the performance of the model drop even if I don’t convert it to a quantized model?
Basically I want to build a model that can achieve it’s best possible performance on a GPU and can be converted into a quantized model whenever needed.
Hi @pritom-kun
Can you clarify what you mean by performance of the model here? Are you referring to model numerics or the training time due to enabling QAT on the model?
Regarding the numerics, the QAT step does alter the model weights assuming the model will be quantized at a later stage, so there may be differences if we compare the result with an FP32 trained model.