I have been off the grid for the last few weeks and have not looked at this once. I was wondering if you had a chance to look at this? If not, no big deal. I have pretty much decided that, for whatever reason, in this case, the restart training does not work and assume there is a bug, but I cannot figure out what to do about it.
I am having the same issues Pytorch 0.4.1. If I save the model the same way as it is explain in the example Imagenet traning at pytorch Github and load the model afterwards, I can see, that it yields different results than it was yielding previously.
I am using train loop like this:
scheduler = StepLR(optimizer, step_size=config.lr_step_size, gamma=config.lr_gammma)
for epoch in range(epochs):
scheduler.step()
train(model) # In the train I set model.train() first
test(model) # In the test I set model.eval() first
# Then I save the model chekpoint (optimizer, epoch, best_acc, model)
After when it is saved, I want to load the model. First I create the exact same model āstructureā and optimizer, and after that I load_state_dict() for each model/optimizer. Is there a difference between saving model in .eval() mode and .train() mode? Imho this shouldnāt be a problem, right?
Or should I set the model to .eval() before going to the train loop after the resuming?
Can be the culprit in the lr_scheduler?
Is there someone who also had these issues? How did you fix that?
@skiddles Were you able to figure out a solution to this problem?
I have been facing the same problem and I actually have found a solution to it, atleast in my case.
There is no problem in your training step, but there might be a problem in your dataloader. Make sure that the dataloader that is being created has the same word_to_ix dictionary everytime. I believe this dictionary is created using the set of words vocab. The set vocab is created everytime you create the dataloader. Now this set differs (in terms of the order of words) everytime you create a new instance of dataloader and hence the dictionary word_to_ix is created differently. This change causes problems everytime you try to resume training. (This essentially means that everytime you resume training you are starting with a random assignment of weights because the embedding corresponding to Word1 might now be the embedding corresponding to Word7 due to the change in order.)
A simple fix for this is to use vocab = sorted(vocab) once you are done parsing through your corpus.