I am new to PyTorch and I am trying to create word embeddings to learn PyTorch. Unfortunately, I am running on an old laptop and only get limited training runs before having to shutdown. So I am trying to set the model up to resume training, only it does not appear to resume training. If interested, the full code is here. I come to this conclusion as when I train for multiple epochs, the loss consistently declines, in this case from about 11,000 down to about 5,000. From epoch n to n+1 the loss is relatively stable. When I try to resume from a saved checkpoint, the loss starts out at 11,000 again.
Maybe I am misinterpreting what is happening. Here are the relevant snippits of code.
Did you use some kind of lr_scheduler?
Your saveing code looks alright so far.
I’ve created a small gist a while ago, since another user reported a similar issue.
Could you compare your code to mine and check, if there are any logical differences?
Thanks for looking over my problem. I can’t see any logical differences between my save/resume code and yours. Not saying that there’s no difference, just that I cannot see any difference. Did you happen to see my full code? One difference I did notice between you gist and my code is that I don’t have a train function. Since I am new, I have been thinking this is okay because I have the code in the main section. But I am questioning everything at this point.
One other thing that I was looking at is the way I tried to implement mini-batch processing. TBH, I don’t see how it would be causing the resume to work incorrectly.
Later, in the training I use slices to pass in a MINI_BATCH of records at a time.
...
for epoch in EPOCH_RANGE:
current_start = 0 #This keeps track of the current starting row in the mini
total_loss = torch.Tensor([0])
keep_going = True
while keep_going:
if current_start + parms.MINI_BATCH < len(text_data.target_ids):
minibatchids = slice(current_start, current_start + parms.MINI_BATCH -1)
else:
minibatchids = slice(current_start, len(text_data.target_ids))
keep_going = False
model.zero_grad()
log_probs = model(torch.Tensor(text_data.context_ids[minibatchids]).long()) # This passes in a slice of instances from the data.
loss = loss_function(log_probs,
torch.autograd.Variable(
torch.squeeze(
torch.Tensor(text_data.target_ids[minibatchids]).long()
)
)
) * parms.MINI_BATCH
loss.backward()
optimizer.step()
...
This SEEMs to work and I get a ten times improvement in throughput, but maybe this is part of the problem? It shouldn’t be as its not the weights. Maybe there is a better way to implement this in a more PyTorch fashion.
I cloned your repo, since it was easier, but thanks for the numbers.
Using the default params, I get an error in calculating the loss.
While log_probs has a shape of [1, 15195], text_data.target_ids[minibatchids] has a shape of [1, 1].
Do you have the same issue?
Thanks! I pulled the current code and still there is an error regarding the loss function. len(text_data.target_ids) is just 1. Could it be the Corpus.pkl is wrong?
I really appreciate that you have been helping me with this. I have been trying to confirm as much as I can, although the holiday has been eating my time thankfully.
I used the following code to try and prove that the weights are being reloaded on a restart.
f = open('./data/weights.txt', 'w')
for x in model.parameters():
f.write(str(x.data))
f.close()
I run this code at the end of each epoch and after loading the model.
I have managed to confirm that the model is setting the weights on load. That said, the loss still goes back to basically the starting values. I am struggling to interpret this behavior. I am left with the following questions:
Is there a reasonable explanation for this behavior?
Is this a function of the loss measure?
Given that this is an unsupervised process, I cannot tell if the model is trained enough. I would assume that I would run it until the loss flatlines, but I cannot run this on my laptop that long, which is why I am focus on the restart function.
Could you train you model for a while, save the optimizer, reload it, and continue the training in the same session?
The issue seems strange and I think we have to debug it step by step.
I did save the optimizer and reloaded it in session and the loss continued to decline from the previous epochs. When I did this in a subsequent session, the loss would bounce back up to the approximate loss from the first epoch.
Is the loss function a static construct? Should I be saving the loss function and reloading it next session?
Your current code in the repo was updated, right?
At least that’s not the local copy I have.
Currently you are accumulating the gradients in the train function.
I’ll pull again and try to run your code. The last time I still had size mismatch errors.