@Rinku_Jadhav2014 unfortunately that tutorial is incomplete to resume training. It will only allow saving a model but it does not save the optimizer, epochs, score, etc.
@Bixqu You can check the ImageNet Example line 139
save_checkpoint({
'epoch': epoch + 1,
'arch': args.arch,
'state_dict': model.state_dict(),
'best_prec1': best_prec1,
'optimizer' : optimizer.state_dict(),
}, is_best)
With
def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
torch.save(state, filename)
if is_best:
shutil.copyfile(filename, 'model_best.pth.tar')
Loading/Resuming from the dictionary is there
if args.resume:
if os.path.isfile(args.resume):
print("=> loading checkpoint '{}'".format(args.resume))
checkpoint = torch.load(args.resume)
args.start_epoch = checkpoint['epoch']
best_prec1 = checkpoint['best_prec1']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
print("=> loaded checkpoint '{}' (epoch {})"
.format(args.resume, checkpoint['epoch']))
else:
print("=> no checkpoint found at '{}'".format(args.resume))
Hi! the best and safe way to save your model parameters is doing something like this:
model = MyModel()
# ... after training, save your model
model.save_state_dict('mytraining.pt')
# .. to load your previously training model:
model.load_state_dict(torch.load('mytraining.pt'))
@diegslva Unfortunately this has the same issue as the tutorial, it won’t save the epoch and the optimizer state so you can’t resume training which was the OP need.
@mratsim, You’re right! I made a mistake here understanding the question.
I don’t use to do that but, maybe something dirty
like that to save you entirely objects:
import copy
import pickle
# model stuff
model = mymodel()
train = trainer.train(model...)
# copy you entirely object and save it
saved_trainer = copy.deepcopy(train)
with open(r"my_trainer_object.pkl", "wb") as output_file:
pickle.dump(saved_trainer, output_file)
@mratsim & @diegslva, when I want to save the trained (i.e., fine tuned) models of ResNet and DenseNet the torch.save(MyModel.state_dict(), './model.pth')
method doesn’t work correctly; and when I used the torch.save(MyModel, './model.pth')
then the models are saved correctly. It means that when I load my saved models via the first approach, my models don’t give me correct results, however when I use the second approach the results are good. Am I correct? would you please explain why this issue occurred?
when you load the model back again via state_dict
method, remember to do MyModel.eval()
, otherwise the results will differ.
Why will the results differ without calling MyModel.eval()
?
because your BatchNorm or Dropout layers by default are in train
mode on construction.
If my model doesn’t use such layers like dropout or batchnorm then it doesn’t make difference to use model or model.eval(), right?
You’re right. It matters only when you use those layers, as described in the document. In theory, BN/Dropout should behave differently in evaluation time so you need manually toggle the setting. You could alternatively use model.train(False). Also, make sure to use eval() at validation time.
I use .eval()
and incorrect either.
HI guys,
I have a question about the behaviour of dropout layer during training and evaluation. I remember reading in a paper that because dropout leave out some units during training. During evaluation, the out going weights of dropout layer need to be reduced an amount corresponding to the dropout rate. For instance, if the dropout rate is 0.5, then the out-going weights need to be reduced by 2, because during evaluation, we effectively have twice the number of units.
So my question is, is this kind of weight scaling mechanism included in the dropout layer in pytorch as well?
Cheers and thanks a lot for your help.
Shuokai
model.eval()
takes care of this. However, I think it is scaling the activations and not the weights.
Ok I understand. Thanks for the help.
Cheers
It is true that model.eval()
takes care of this. However, it scales when training.
Furthermore, the outputs are scaled by a factor of 1/(1-p) during training. This means that during evaluation the module simply computes an identity function.
Newbie question…
Any conventions for filename extensions for saving model and model weights with the following commands?
torch.save(the_model, PATH)
torch.save(the_model.state_dict(), PATH)
we’ve been using .pth
, but it’s pretty arbitrary
Hi, I’m trying to implement training with check points using the above ideas, so that I could resume training from say, Epoch k and re-train the model from Epoch k to N. Suppose I’ve saved the following into the model file and reloaded in resume training: epoch, model’s state_dict(), optimizer, but I’m not seen similar training results between the two ways:
- train the model from Epoch 1 to N.
- train the model from Epoch1 to k, save the model, and resume training starting from Epoch k to N.
I checked the learning rates to be consistent between 1) and 2), using SGD with the same momentum and weight decaying rates.
Any ideas where I should be looking into?
Thanks!