Training loss changes after resuming

NosremeC · July 30, 2021, 6:54pm

I have been stuck with this wired phenomenon for a really long time and read through many discussions but not yet found a solution.

So I have trained several models on colab pro, with a Tesla V100 GPU. The loss generally decreased in the first few epochs, but after that it started to go up sometimes. Natually, I decided to restarted the training by resuming from the checkpoint every time I saw the loss increasing(and I’m confident all parameters from models and optimizers were properly saved and loaded), and then magically the loss went down after I did it. This is almost always the case.

I am working on mutiple object tracking, so it is a muti-task model, and I used uncertainty loss, both s_det and s_id are trained, saved and loaded. I also used Adam as my optimizer.

        if opt.multi_loss == 'uncertainty':
            loss = torch.exp(-self.s_det) * det_loss + torch.exp(-self.s_id) * id_loss + (self.s_det + self.s_id)
            loss *= 0.5

        self.s_det = nn.Parameter(-1.85 * torch.ones(1))
        self.s_id = nn.Parameter(-1.05 * torch.ones(1))

I am just wondering if anyone know what happened, and why resuming the training can somehow magically ‘force’ the loss to decrease even it was intially increasing? Many thanks!

ptrblck · July 30, 2021, 10:52pm

I don’t think resuming the training from a specific epoch forces the loss to decrease, but it sounds more as if your model diverges non-deterministically. I.e. I assume you don’t expect to see bitwise-accurate outputs after resuming the training compared to the initial training run.

NosremeC · July 31, 2021, 12:49am

I do agree with you that simply resuming from the checkpoint should not be the reason the loss was decreasing. It didn’t make sense to me either . However, the fact the loss went down after resuming told me the model somehow started to perform better, at least on the training set, right? Are you suggesting the model would only perform better on training set, but not on the test set if I kept paused and resuming the training?