I have been stuck with this wired phenomenon for a really long time and read through many discussions but not yet found a solution.
So I have trained several models on colab pro, with a Tesla V100 GPU. The loss generally decreased in the first few epochs, but after that it started to go up sometimes. Natually, I decided to restarted the training by resuming from the checkpoint every time I saw the loss increasing(and I’m confident all parameters from models and optimizers were properly saved and loaded), and then magically the loss went down after I did it. This is almost always the case.
I am working on mutiple object tracking, so it is a muti-task model, and I used uncertainty loss, both s_det and s_id are trained, saved and loaded. I also used Adam as my optimizer.
if opt.multi_loss == 'uncertainty':
loss = torch.exp(-self.s_det) * det_loss + torch.exp(-self.s_id) * id_loss + (self.s_det + self.s_id)
loss *= 0.5
self.s_det = nn.Parameter(-1.85 * torch.ones(1))
self.s_id = nn.Parameter(-1.05 * torch.ones(1))
I am just wondering if anyone know what happened, and why resuming the training can somehow magically ‘force’ the loss to decrease even it was intially increasing? Many thanks!