Resume Training Does not Work

I am new to PyTorch and I am trying to create word embeddings to learn PyTorch. Unfortunately, I am running on an old laptop and only get limited training runs before having to shutdown. So I am trying to set the model up to resume training, only it does not appear to resume training. If interested, the full code is here. I come to this conclusion as when I train for multiple epochs, the loss consistently declines, in this case from about 11,000 down to about 5,000. From epoch n to n+1 the loss is relatively stable. When I try to resume from a saved checkpoint, the loss starts out at 11,000 again.

Maybe I am misinterpreting what is happening. Here are the relevant snippits of code.

At the end of each epoch I save the model:

def save_checkpoint(state, filename):
    torch.save(state, filename)

save_checkpoint({
            'epoch': epoch + 1,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }, parms.save_dir)

Then on resume, I load the model with this code:

checkpoint = torch.load(parms.save_dir)
parms.start_epoch = checkpoint['epoch']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])

Training resumes at the next epoch, but the loss values make me think the weights are not saved or reapplied.

What am I missing?

1 Like

Did you use some kind of lr_scheduler?
Your saveing code looks alright so far.
I’ve created a small gist a while ago, since another user reported a similar issue.
Could you compare your code to mine and check, if there are any logical differences?

2 Likes

Hi ptrblck,

Thanks for looking over my problem. I can’t see any logical differences between my save/resume code and yours. Not saying that there’s no difference, just that I cannot see any difference. Did you happen to see my full code? One difference I did notice between you gist and my code is that I don’t have a train function. Since I am new, I have been thinking this is okay because I have the code in the main section. But I am questioning everything at this point.

One other thing that I was looking at is the way I tried to implement mini-batch processing. TBH, I don’t see how it would be causing the resume to work incorrectly.

Here is my line for the DataLoader:

    dataloader = DataLoader(text_data,
                            batch_size=parms.MINI_BATCH,
                            shuffle=True,
                            num_workers=3)

Later, in the training I use slices to pass in a MINI_BATCH of records at a time.

...
    for epoch in EPOCH_RANGE:

        current_start = 0 #This keeps track of the current starting row in the mini
        total_loss = torch.Tensor([0])

        keep_going = True
        while keep_going:
            if current_start + parms.MINI_BATCH < len(text_data.target_ids):
                minibatchids = slice(current_start, current_start + parms.MINI_BATCH -1)
            else:
                minibatchids = slice(current_start, len(text_data.target_ids))
                keep_going = False

            model.zero_grad()
            log_probs = model(torch.Tensor(text_data.context_ids[minibatchids]).long()) # This passes in a slice of instances from the data.
            loss = loss_function(log_probs,
                                 torch.autograd.Variable(
                                     torch.squeeze(
                                        torch.Tensor(text_data.target_ids[minibatchids]).long()
                                        )
                                     )
                                 ) * parms.MINI_BATCH
            loss.backward()
            optimizer.step()
...

This SEEMs to work and I get a ten times improvement in throughput, but maybe this is part of the problem? It shouldn’t be as its not the weights. Maybe there is a better way to implement this in a more PyTorch fashion.

Since I didn’t document it earlier, I am running:

  • Linux 16.04 LTS
  • Python 3.6
  • PyTorch 0.40 (CPU bound as no eligible GPU)

Any help would be greatly appreciated

Could you help me out with some tensor shapes to get the code running without the dataset?

  • len(text_data.vocab)
  • shape of text_data.context_ids[minibatchids]).long()

len(text_data.vocab) = 15195

text_data.context_ids[minibatchids]).long() = torch.Size([999, 2])

Thanks

I cloned your repo, since it was easier, but thanks for the numbers. :wink:

Using the default params, I get an error in calculating the loss.
While log_probs has a shape of [1, 15195], text_data.target_ids[minibatchids] has a shape of [1, 1].
Do you have the same issue?

text_data.target_ids[minibatchids] should be [999,1]

log_probs`` shape is [999, 15195]

Let me check the default params on github. I am new to github so maybe the defaults are not what I expected.

I did actually update a few of the default parameters.

Thanks! I pulled the current code and still there is an error regarding the loss function.
len(text_data.target_ids) is just 1. Could it be the Corpus.pkl is wrong?

I assume the pickle file (clean_corpus.pkl) is ok. I have downloaded it from github and it seems to work for me.

Hi ptrblck,

I really appreciate that you have been helping me with this. I have been trying to confirm as much as I can, although the holiday has been eating my time thankfully. :slight_smile:

I used the following code to try and prove that the weights are being reloaded on a restart.

        f = open('./data/weights.txt', 'w')
        for x in model.parameters():
            f.write(str(x.data))
        f.close()

I run this code at the end of each epoch and after loading the model.

I have managed to confirm that the model is setting the weights on load. That said, the loss still goes back to basically the starting values. I am struggling to interpret this behavior. I am left with the following questions:

  1. Is there a reasonable explanation for this behavior?
  2. Is this a function of the loss measure?

Given that this is an unsupervised process, I cannot tell if the model is trained enough. I would assume that I would run it until the loss flatlines, but I cannot run this on my laptop that long, which is why I am focus on the restart function.

Any advice would be appreciated.

Could you train you model for a while, save the optimizer, reload it, and continue the training in the same session?
The issue seems strange and I think we have to debug it step by step.

Hello again,

I did save the optimizer and reloaded it in session and the loss continued to decline from the previous epochs. When I did this in a subsequent session, the loss would bounce back up to the approximate loss from the first epoch.

Is the loss function a static construct? Should I be saving the loss function and reloading it next session?

Thanks ptrblck

What kind of loss function are you using?
Your code just specifies loss_function.
Is it stateful, i.e. does it contain any parameters?

loss_function = nn.NLLLoss() no parameters.

Should it be stateful? How do I make it stateful?

Sounds like it should be stateful

No, it’s fine. You don’t have to save and load it, since it’s just a function without any stored parameters.

Okay. I am at a loss (pun intended)

What to do next?

Your current code in the repo was updated, right?
At least that’s not the local copy I have.
Currently you are accumulating the gradients in the train function.

I’ll pull again and try to run your code. The last time I still had size mismatch errors.

There are minor differences between the repo and the test I ran saving the optimizer and reloading it.

Let me know if it does not work for you.