Resume Training Does not Work


(Skiddles) #21

Hi ptrblck,

I have been off the grid for the last few weeks and have not looked at this once. I was wondering if you had a chance to look at this? If not, no big deal. I have pretty much decided that, for whatever reason, in this case, the restart training does not work and assume there is a bug, but I cannot figure out what to do about it.

Regards
David


#22

Thanks for reminding me. I’ve pulled the latest changes and can give it tomorrow a try.


(Skiddles) #23

Thanks. Really, there’s no rush.


#24

Thanks, good to know. Anyway this issue bugs me, so I guess I’ll have a look in the next couple of days. :wink:


(Skiddles) #25

Yes. That’s how I feel about it too.


(bhargavaurala) #26

Hello @skiddles and @ptrblck,

Has there been a follow-up or a solution to this issue? I am experiencing similar troubles with save and resume of states on pytorch 0.4.1.

I am willing to share code (a private gitlab repo) and the data (public dataset) in order to help reproduce the issue. Please advise.

Thanks and regards,
Bhargava


(Iraquitan Cordeiro Filho) #27

Same here, but I’m using pytorch 1.0.0, when I load the model and optimizer and resume training the loss bounce back to initial loss. Any updates?


(Jan Stratil) #28

I am having the same issues Pytorch 0.4.1. If I save the model the same way as it is explain in the example Imagenet traning at pytorch Github and load the model afterwards, I can see, that it yields different results than it was yielding previously.

I am using train loop like this:

scheduler = StepLR(optimizer, step_size=config.lr_step_size, gamma=config.lr_gammma)

for epoch in range(epochs):
    scheduler.step()

    train(model) # In the train I set model.train() first
    test(model) # In the test I set model.eval() first
    # Then I save the model chekpoint (optimizer, epoch, best_acc, model)

After when it is saved, I want to load the model. First I create the exact same model “structure” and optimizer, and after that I load_state_dict() for each model/optimizer. Is there a difference between saving model in .eval() mode and .train() mode? Imho this shouldn’t be a problem, right?

Or should I set the model to .eval() before going to the train loop after the resuming?

Can be the culprit in the lr_scheduler?

Is there someone who also had these issues? How did you fix that?

Thank you very much !