Resume Training Does not Work

skiddles · July 31, 2018, 10:22pm

Hi ptrblck,

I have been off the grid for the last few weeks and have not looked at this once. I was wondering if you had a chance to look at this? If not, no big deal. I have pretty much decided that, for whatever reason, in this case, the restart training does not work and assume there is a bug, but I cannot figure out what to do about it.

Regards
David

ptrblck · July 31, 2018, 10:28pm

Thanks for reminding me. I’ve pulled the latest changes and can give it tomorrow a try.

skiddles · July 31, 2018, 10:29pm

Thanks. Really, there’s no rush.

ptrblck · July 31, 2018, 10:31pm

Thanks, good to know. Anyway this issue bugs me, so I guess I’ll have a look in the next couple of days.

skiddles · July 31, 2018, 10:33pm

Yes. That’s how I feel about it too.

bhargavaurala · October 2, 2018, 5:26am

Hello @skiddles and @ptrblck,

Has there been a follow-up or a solution to this issue? I am experiencing similar troubles with save and resume of states on pytorch 0.4.1.

I am willing to share code (a private gitlab repo) and the data (public dataset) in order to help reproduce the issue. Please advise.

Thanks and regards,
Bhargava

iraquitan · December 17, 2018, 2:03pm

Same here, but I’m using pytorch 1.0.0, when I load the model and optimizer and resume training the loss bounce back to initial loss. Any updates?

Honzys · December 25, 2018, 8:33am

I am having the same issues Pytorch 0.4.1. If I save the model the same way as it is explain in the example Imagenet traning at pytorch Github and load the model afterwards, I can see, that it yields different results than it was yielding previously.

I am using train loop like this:

scheduler = StepLR(optimizer, step_size=config.lr_step_size, gamma=config.lr_gammma)

for epoch in range(epochs):
    scheduler.step()

    train(model) # In the train I set model.train() first
    test(model) # In the test I set model.eval() first
    # Then I save the model chekpoint (optimizer, epoch, best_acc, model)

After when it is saved, I want to load the model. First I create the exact same model “structure” and optimizer, and after that I load_state_dict() for each model/optimizer. Is there a difference between saving model in .eval() mode and .train() mode? Imho this shouldn’t be a problem, right?

Or should I set the model to .eval() before going to the train loop after the resuming?

Can be the culprit in the lr_scheduler?

Is there someone who also had these issues? How did you fix that?

Thank you very much !

rachit221195 · September 30, 2019, 5:08pm

@skiddles Were you able to figure out a solution to this problem?
I have been facing the same problem and I actually have found a solution to it, atleast in my case.
There is no problem in your training step, but there might be a problem in your dataloader. Make sure that the dataloader that is being created has the same word_to_ix dictionary everytime. I believe this dictionary is created using the set of words vocab. The set vocab is created everytime you create the dataloader. Now this set differs (in terms of the order of words) everytime you create a new instance of dataloader and hence the dictionary word_to_ix is created differently. This change causes problems everytime you try to resume training. (This essentially means that everytime you resume training you are starting with a random assignment of weights because the embedding corresponding to Word1 might now be the embedding corresponding to Word7 due to the change in order.)
A simple fix for this is to use vocab = sorted(vocab) once you are done parsing through your corpus.

GregM · August 29, 2020, 10:00pm

I have similar problem. After saving and loading model state_dict and optimizer I get very poor results. Like it was not the network I saved