Is there some problems use resume in Pytorch?

When i train my network, in 156th epochs the training is break. So i use --resume to continue training(load the last checkpoint).
But i see one phenomenon: the train loss will rise from 0.0870 to 0.1321, and it will take many epochs decrease to 0.08 again.

I also meet this problem in other training(after use --resume to load last checkpoint, the training loss will raise from 0.1557 to 0.1999. The val loss will raise from 0.2488 to 0.2657 and the iou will decrease from 0.6419 to 0.6009.):


So use --resume to continue the training will reduse accuracy? Is it normal in this case?
Thank you.

It depends how youā€™ve implemented your ā€œresumeā€ logic. From your description I assume you are just loading the state_dict and start the training with a new optimizer.
Using an ā€œadaptiveā€ optimizer might worsen your accuracy, since the ā€œoldā€ optimizer had some internal states, momentum etc., while the new one will have a cold start.
You could try to save the optimizerā€™s state_dict as well.

However, why did the training break in the first place? Could you post the error and some information?

Hi, here is the code i use resume:

save_checkpoint({
ā€˜epochā€™: epoch + 1,
ā€˜archā€™: str(model),
ā€˜state_dictā€™: model.state_dict(),
ā€˜best_accā€™: best_acc,
ā€˜optimizerā€™ : optimizer.state_dict(),
}, is_best, filenameCheckpoint, filenameBest)

if args.resume:
    if enc:
        filenameCheckpoint = savedir + '/checkpoint_enc.pth.tar'
    else:
        filenameCheckpoint = savedir + '/checkpoint.pth.tar'

    assert os.path.exists(filenameCheckpoint), "Error: resume option was used but checkpoint was not found in folder"
    checkpoint = torch.load(filenameCheckpoint)
    start_epoch = checkpoint['epoch']
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    best_acc = checkpoint['best_acc']
    print("=> Loaded checkpoint at epoch {})".format(checkpoint['epoch']))

In the code, maybe i have save the state_dict, so i also donā€™t know how to solve this problem.

About the training break, it is normal for me. Because i use a free machine from my school. It will break every 10 hours. I know itā€™s hurt, but i canā€™t do something for that.

Thank you.

Looks perfectly fine to me. Just like the ImageNet example.

Which optimizer are you using? I would like to check if with a small dummy example.

Hi ptrblck,
my optimizer is:
optimizer = Adam(model.parameters(), 5e-4, (0.9, 0.999), eps=1e-08, weight_decay=1e-4)

Thank you.

I created a small dummy example and cannot recreate the issue.
I wanted to try of layers like Dropout and BatchNorm could possibly change something, but at least in the same terminal with a seed the model returns the same loss values.

Still, it could be an issue with the random number generator, although I cannot explain, why the loss spikes that much.
Could you try to compare the predictions after calling model.eval?

1 Like

Thank you, i will try to find other method to solve this problem.
Thanks for your kindly response.:+1:

Hi I met the same issue, have you figured out it?

Best

1 Like

I have the same issue, too. After I resume the model, loss is much higher than before.

1 Like

Hello, @ptrblck. What if I use pytorch lightning Trainer.fit to train the model, how can I resume training?
My approach is:

model = FaultNetPL(batch_size = 5).cuda()

filenameCheckpoint = 'experiments/FaultNet/epoch=187-avg_valid_iou=0.6822.ckpt'
checkpoint = torch.load(filenameCheckpoint)

trainer = Trainer(checkpoint_callback=checkpoint_callback, 
                           resume_from_checkpoint = checkpoint,
                           max_epochs=400,
                           gpus=1,
                           logger=logger)

trainer.fit()

Is this correct approach ?

I donā€™t know, how Lightning resumes the training, so this topic might be helpful. CC @williamFalcon