Loss suddenly increases using Adam optimizer

zhangboknight · December 19, 2017, 12:02pm

Hi, I came across a problem when using Adam optimizer. At the start of the training, the loss decreases as expected. But after 3300 iterations, the loss suddenly explodes to a very large number(~1e3). I tried several times but the same problem occurs. How to solve this issue? Thanks!

Capture

munkiti · December 19, 2017, 12:46pm

Did you restart at 3300’th iteration? Or did you run it all along?. I think you need to give more info. what is the problem you are working with? What’s the step-size?

In anycase, the problem with Adam is that it uses moving average in the denominator term. So if the gradients get really small and the whole of denominator will be small. Since the gradients are already small, the denominator results in blowup thus pushing you very far away hence huge loss. You may have a look at https://openreview.net/forum?id=ryQu7f-RZ . I think there are many recent methods which avert this problem including AMSGrad (in the earlier mentioned paper), Hyper-gradient descent (for Adam) etc. Also look for comments in the openreview forum, there seems to be further discussion on this issue. Unfortunately i do not know of any pytorch implementation of these algorithms.

zhangboknight · December 19, 2017, 2:22pm

Thanks a lot for your detailed reply, Munkiti. I run my training all along without any restart. The learning rate for Adam is 1e-3. The network is typical resnet structure.

I will check whether the problem comes from the small denominator with Adam. I will post it when I find a solution.

zhangboknight · December 20, 2017, 6:50am

As suggestion, I replace the Adam optimizer with AMSGrad. The problem is solved^^ It indeed comes from the stabilization issue of the Adam itself.

In implementation, I reinstall my pytorch from source and in version 4.0, I can simply use AMSGrad with:
optimizer = optim.Adam(model.parameters(), lr=0.001, eps=1e-3, amsgrad=True)

Thanks for your help very much!

Rakshit_Kothari · November 26, 2018, 10:28pm

While AMSGrad really improves the training loss curve and it seems to progress for a longer number of epochs, but after certain number of epochs, even AMSGrad tends to increase training loss

yj_z · January 13, 2019, 7:10am

Did you solve this problem?

Rakshit_Kothari · January 15, 2019, 5:26pm

Well, kind of. As the training performance improves, I linearly reduce the learning rate (learning rate at perfect performance is 1/10th the original LR). This significantly combats this tendency to overshoot.

ryo · January 27, 2019, 11:08am

Have you tried just simply clipping the gradient?

Rakshit_Kothari · April 23, 2019, 8:54pm

Yes, I use a combination of gradient clipping and batch normalization which has pretty much ensured that this never occurs again.