Strange loss increase when restart training

gabrielvc · April 23, 2018, 4:40pm

Hello,

I have a quite odd problem with my loss function during training. I have a function train(n_epochs, batch_size) that performs training with a certain number of epochs and a certain batch_size. The problem is that when i restart training, the loss suddenly increases.

For example, in the following figure we can see that the loss suddenly increases around 30000 iterations, that corresponds to a second training I started. At first, I thought this was due to the running means on the batch normalization, but I reset them and that does not seem to change anything.

Thanks in advance,
Gabriel

awal · April 23, 2018, 5:21pm

Could be your learning rate? If it decays based on the number of iterations, make sure that when training restarts the number of iterations is correct.

gabrielvc · April 24, 2018, 7:34am

Yes, it could, i’m using anealing for decreasing it, but anyway I’m restarting with the same learning rate as the end of the last training, but thanks!

Diego · April 24, 2018, 8:00am

Make sure that you are saving your optimizer before interrupting training and load it before restarting training.

gabrielvc · April 24, 2018, 8:46am

I’ve just tried saving the state of the optimizer and reloading it and I am still getting the same thing…

Nabarun_Goswami · April 24, 2018, 9:01am

Do you shuffle your training data? It might be bad samples at the beginning of you dataset which might effect the training restart. Do you observe similar pattern if you restart multiple times?

gabrielvc · April 24, 2018, 9:07am

Yes, I do shuffle the data, but with replacement… I’m doing a word2vec kind of thing, so I drawing with probabilities that varies with the number of times a certain word appears in my dataset. So the distribution must always be the same.

I’ve just run two trainings, saving all the parameters from the optimizer (the state dict and learning rate) and i got this:

Nabarun_Goswami · April 24, 2018, 9:43am

Without knowing what exactly your code does, I can think of the following other reason:

If you are using Adam optimizer, your learning rates that you set from your script might mess things up. Because Adam inherently has a square root decay of learning rate. So at the restart of training, even though reloading the optimizer state dict will resume training, I am guessing you again set you annealed learning rate. This might cause your problem.

gabrielvc · April 24, 2018, 9:49am

I thought about that (that the problem was coming from adam decay of the learning rate) and changed the optimizer for SGD and still got the same result

Nabarun_Goswami · April 24, 2018, 10:12am

I see. I still feel some problem with the data. Maybe you can try on an official example and see if it’s some problem with pytorch itself. If not then you can probably debug you data.

gabrielvc · April 24, 2018, 10:12am

Yes, I think that’s the most likely explanation. I’ll keep trying! Thanks!

Honzys · December 25, 2018, 9:39am

Hello,

did you somehow solve this issue?

Thank you