I am working on the QAT training for my model.
I compared the two training cases, i.e., training an FP32 model vs QAT based on FP32 mdoel.
What I observed is the time for each epoch during training is similar.
But when comparing the loss decreasing, I found the QAT is extremely slow.
For example to achieve 1.5 (just an example) from 5.0 the FP32 training just needs 50 epoch. But for the QAT from 5.0 to 3.5, it has taken 6k epoch, and seems the loss decreasing is getting slower and slower.
BTW, all the learning rate, optimizer are the same for these two training.
This might be expected because the gradient will be for a “quantized approximation” of the current parameters rather than the parameters.
But so one thing you could do is first train a bit in FP32 and then do some QAT training.
You should not really lose much by looking in fp32 for where to start the QAT.