Ok I understand. Thanks for the help.
Cheers
Ok I understand. Thanks for the help.
Cheers
It is true that model.eval()
takes care of this. However, it scales when training.
Furthermore, the outputs are scaled by a factor of 1/(1-p) during training. This means that during evaluation the module simply computes an identity function.
Newbie question…
Any conventions for filename extensions for saving model and model weights with the following commands?
torch.save(the_model, PATH)
torch.save(the_model.state_dict(), PATH)
we’ve been using .pth
, but it’s pretty arbitrary
Hi, I’m trying to implement training with check points using the above ideas, so that I could resume training from say, Epoch k and re-train the model from Epoch k to N. Suppose I’ve saved the following into the model file and reloaded in resume training: epoch, model’s state_dict(), optimizer, but I’m not seen similar training results between the two ways:
I checked the learning rates to be consistent between 1) and 2), using SGD with the same momentum and weight decaying rates.
Any ideas where I should be looking into?
Thanks!
@galactica147
can you reproduce this in a small dummy / toy script? I’m interested in looking into it if you can repro it.
Hi @smth, thanks for your reply! It will be hard to reproduce it fully, but i’ll try my best to provide as many details as possible.
In my task of object detection in 3D images, I have a U-shape resnet architecture with Faster R-CNN framework for object detection, i.e., finding objects (objectiveness score/probability) and their locations (coordinates of the bounding box).
So in training, i have something like this:
for i, (data, target) in enumerate(data_loader):
data = Variable(data.cuda(async=True))
target = Variable(target.cuda(async=True))
out_dict = net(data)
loss_output = loss(out_dict['predictions'], target, train=True)
optimizer.zero_grad()
loss_output.backward()
optimizer.step()
if epoch % save_freq == 0:
save_model(net, optimizer, os.path.join(save_dir, '%03d.ckpt' % self.epoch))
The following is saved in the model file:
def save_model(net, optim, ckpt_fname):
state_dict = net.module.state_dict()
for key in state_dict.keys():
state_dict[key] = state_dict[key].cpu()
torch.save({
'epoch': epoch,
'state_dict': state_dict,
'optimizer': optim},
ckpt_fname)
Then in resume training, I load in the pre-trained model file, and use that to resume to the training stage of say, Epoch k. Before the training actually happens, I check if this is regular or resume training:
if pretrained is not None:
state_dict = torch.load(pretrained)
new_state_dict = OrderedDict()
for k, value in state_dict['state_dict'].iteritems():
key = "module.{}".format(k)
new_state_dict[key] = value
net.load_state_dict(new_state_dict)
epoch = state_dict['epoch']
print "pre-trained epoch number: {}".format(epoch)
optimizer = state_dict['optimizer']
else:
optimizer = SGD(
net.parameters(),
learning_rate,
momentum=momentum,
weight_decay=weight_decay)
For the learning rate, I have stair-wise decaying, so to train for 100 epochs, I decay the learning rate to 1/10 at Epoch 50, and then another 1/10 at Epoch 80. In this example, k=5, so the learning rate should be the same for the two comparison runs above (baseline 1 and resume training 2).
The result is in baseline 1, I saw gradually decreasing false negative rates. While for the 2nd case, I didn’t see such progress during resume training. Also, the loss values were significantly different. I’m aware of the non-deterministic nature of GPU training, but that should not be the entire reason of such discrepancy.
I ran these two multiple times and got the same distinct results each time.
Did i miss anything important here?
Thank you!
i read through your explanation, but if you can reproduce such behaviour even through modifying our mnist example: https://github.com/pytorch/examples/blob/master/mnist/main.py
then it would be much more helpful.
If i get a script that i run and look through, i can debug with it.
Did you ever solve this. I’m having a similar problem with an assignment due saturday.
I trained for a week and my computer crashed, but I tried restoring my model (saved the entire model not just state dict) on another computer so I could train the last little bit. I’ve ensured that all my optimizer values(using vanilla sgd)/learning rates are the exact same; however, I get very pronounced gradient explosion immediately that prevents me from training anything further. The only difference is that I’m not on the same gpu. I even tried to move my model to cpu first then calling .cuda() after to move it back to the gpu.
I have the same problem. Did you find any solution?
Instead of save_state_dict
shouldn’t it be torch.save(model.state_dict(), 'mytraining.pt')
? (The former wouldn’t work for me, getting errors that such a method would not exist)
Yes, you are right. Your approach works fine. The recommended way of saving and loading is described here.
Why is a deep copy needed?
huh I’m confused, is torch.save(model, ...)
actually wrong and should we be using torch.save(model.state_dict(), ...)
instead?
I’m trying to convert a pytorch machine learning model (.pth) into cafe or onnx yet I’m confused how to utilize (.pth) properly. I’m left with a text and image encoder in (.pth) trained to synthesis images through text messages. I tried to insert text_encoder.state_dict() directly but I keep getting errors.
torch_out_image = torch.onnx._export(text_encoder.state_dict(), x, “super_resolution.onnx”, verbose=True)
Does anyone know the proper format from a text encoder onto a onnx converter?
Do you mean after reloading the model for continueing the training I also have to call .eval
once? Or did you mean this just for the evaluation/inferencing? It’s not really clear to me.
This is how I do it:
torch.save(net.state_dict(),model_save_path + '_.pth')
save_checkpoint({
'epoch': epoch + 1,
# 'arch': args.arch,
'state_dict': net.state_dict(),
'optimizer': optimizer.state_dict(),
}, is_best, mPath , str(val_acc) + '_' + str(val_los) + "_" + str(epoch) + '_checkpoint.pth.tar')
Where:
def save_checkpoint(state, is_best, save_path, filename):
filename = os.path.join(save_path, filename)
torch.save(state, filename)
if is_best:
bestname = os.path.join(save_path, 'model_best.pth.tar')
shutil.copyfile(filename, bestname)