Is there some problems use resume in Pytorch?

CSUMIT · April 12, 2018, 6:38am

When i train my network, in 156th epochs the training is break. So i use --resume to continue training(load the last checkpoint).
But i see one phenomenon: the train loss will rise from 0.0870 to 0.1321, and it will take many epochs decrease to 0.08 again.

I also meet this problem in other training(after use --resume to load last checkpoint, the training loss will raise from 0.1557 to 0.1999. The val loss will raise from 0.2488 to 0.2657 and the iou will decrease from 0.6419 to 0.6009.):

So use --resume to continue the training will reduse accuracy? Is it normal in this case?
Thank you.

ptrblck · April 12, 2018, 6:56am

It depends how you’ve implemented your “resume” logic. From your description I assume you are just loading the state_dict and start the training with a new optimizer.
Using an “adaptive” optimizer might worsen your accuracy, since the “old” optimizer had some internal states, momentum etc., while the new one will have a cold start.
You could try to save the optimizer’s state_dict as well.

However, why did the training break in the first place? Could you post the error and some information?

CSUMIT · April 12, 2018, 7:12am

Hi, here is the code i use resume:

save_checkpoint({
‘epoch’: epoch + 1,
‘arch’: str(model),
‘state_dict’: model.state_dict(),
‘best_acc’: best_acc,
‘optimizer’ : optimizer.state_dict(),
}, is_best, filenameCheckpoint, filenameBest)

if args.resume:
    if enc:
        filenameCheckpoint = savedir + '/checkpoint_enc.pth.tar'
    else:
        filenameCheckpoint = savedir + '/checkpoint.pth.tar'

    assert os.path.exists(filenameCheckpoint), "Error: resume option was used but checkpoint was not found in folder"
    checkpoint = torch.load(filenameCheckpoint)
    start_epoch = checkpoint['epoch']
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    best_acc = checkpoint['best_acc']
    print("=> Loaded checkpoint at epoch {})".format(checkpoint['epoch']))

In the code, maybe i have save the state_dict, so i also don’t know how to solve this problem.

About the training break, it is normal for me. Because i use a free machine from my school. It will break every 10 hours. I know it’s hurt, but i can’t do something for that.

Thank you.

ptrblck · April 12, 2018, 8:26am

Looks perfectly fine to me. Just like the ImageNet example.

Which optimizer are you using? I would like to check if with a small dummy example.

CSUMIT · April 12, 2018, 9:30am

Hi ptrblck,
my optimizer is:
optimizer = Adam(model.parameters(), 5e-4, (0.9, 0.999), eps=1e-08, weight_decay=1e-4)

Thank you.

ptrblck · April 12, 2018, 1:43pm

I created a small dummy example and cannot recreate the issue.
I wanted to try of layers like Dropout and BatchNorm could possibly change something, but at least in the same terminal with a seed the model returns the same loss values.

Still, it could be an issue with the random number generator, although I cannot explain, why the loss spikes that much.
Could you try to compare the predictions after calling model.eval?

CSUMIT · April 12, 2018, 1:53pm

Thank you, i will try to find other method to solve this problem.
Thanks for your kindly response.

senmao · March 27, 2020, 7:35am

Hi I met the same issue, have you figured out it?

Best

111248 · August 16, 2020, 1:45am

I have the same issue, too. After I resume the model, loss is much higher than before.

andreys42 · March 10, 2021, 6:40am

Hello, @ptrblck. What if I use pytorch lightning Trainer.fit to train the model, how can I resume training?
My approach is:

model = FaultNetPL(batch_size = 5).cuda()

filenameCheckpoint = 'experiments/FaultNet/epoch=187-avg_valid_iou=0.6822.ckpt'
checkpoint = torch.load(filenameCheckpoint)

trainer = Trainer(checkpoint_callback=checkpoint_callback, 
                           resume_from_checkpoint = checkpoint,
                           max_epochs=400,
                           gpus=1,
                           logger=logger)

trainer.fit()

Is this correct approach ?

ptrblck · March 10, 2021, 7:41am

I don’t know, how Lightning resumes the training, so this topic might be helpful. CC @williamFalcon