QAT training loss decreasing speed is very slow

bigtree · September 23, 2021, 5:34pm

I am working on the QAT training for my model.
I compared the two training cases, i.e., training an FP32 model vs QAT based on FP32 mdoel.
What I observed is the time for each epoch during training is similar.
But when comparing the loss decreasing, I found the QAT is extremely slow.
For example to achieve 1.5 (just an example) from 5.0 the FP32 training just needs 50 epoch. But for the QAT from 5.0 to 3.5, it has taken 6k epoch, and seems the loss decreasing is getting slower and slower.

BTW, all the learning rate, optimizer are the same for these two training.

Is there expected?
Thanks.

tom · September 23, 2021, 7:29pm

This might be expected because the gradient will be for a “quantized approximation” of the current parameters rather than the parameters.
But so one thing you could do is first train a bit in FP32 and then do some QAT training.
You should not really lose much by looking in fp32 for where to start the QAT.

Best regards

Thomas

bigtree · September 23, 2021, 9:18pm

Thanks, @tom.
Yes. If there is no other try, I think that will be the only way to accelerate my QAT training.

yyd199948 · May 15, 2023, 11:23am

How did you solve it？I meet the same problem，float32 model training 1 epoch the loss can down to -9,but the qat model traing 20 epoch only down to -4.