Resume Training Does not Work

skiddles · June 30, 2018, 6:12pm

I am new to PyTorch and I am trying to create word embeddings to learn PyTorch. Unfortunately, I am running on an old laptop and only get limited training runs before having to shutdown. So I am trying to set the model up to resume training, only it does not appear to resume training. If interested, the full code is here. I come to this conclusion as when I train for multiple epochs, the loss consistently declines, in this case from about 11,000 down to about 5,000. From epoch n to n+1 the loss is relatively stable. When I try to resume from a saved checkpoint, the loss starts out at 11,000 again.

Maybe I am misinterpreting what is happening. Here are the relevant snippits of code.

At the end of each epoch I save the model:

def save_checkpoint(state, filename):
    torch.save(state, filename)

save_checkpoint({
            'epoch': epoch + 1,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }, parms.save_dir)

Then on resume, I load the model with this code:

checkpoint = torch.load(parms.save_dir)
parms.start_epoch = checkpoint['epoch']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])

Training resumes at the next epoch, but the loss values make me think the weights are not saved or reapplied.

What am I missing?

ptrblck · June 30, 2018, 7:48pm

Did you use some kind of lr_scheduler?
Your saveing code looks alright so far.
I’ve created a small gist a while ago, since another user reported a similar issue.
Could you compare your code to mine and check, if there are any logical differences?

skiddles · July 2, 2018, 12:47pm

Hi ptrblck,

Thanks for looking over my problem. I can’t see any logical differences between my save/resume code and yours. Not saying that there’s no difference, just that I cannot see any difference. Did you happen to see my full code? One difference I did notice between you gist and my code is that I don’t have a train function. Since I am new, I have been thinking this is okay because I have the code in the main section. But I am questioning everything at this point.

skiddles · July 2, 2018, 10:15pm

One other thing that I was looking at is the way I tried to implement mini-batch processing. TBH, I don’t see how it would be causing the resume to work incorrectly.

Here is my line for the DataLoader:

    dataloader = DataLoader(text_data,
                            batch_size=parms.MINI_BATCH,
                            shuffle=True,
                            num_workers=3)

Later, in the training I use slices to pass in a MINI_BATCH of records at a time.

...
    for epoch in EPOCH_RANGE:

        current_start = 0 #This keeps track of the current starting row in the mini
        total_loss = torch.Tensor([0])

        keep_going = True
        while keep_going:
            if current_start + parms.MINI_BATCH < len(text_data.target_ids):
                minibatchids = slice(current_start, current_start + parms.MINI_BATCH -1)
            else:
                minibatchids = slice(current_start, len(text_data.target_ids))
                keep_going = False

            model.zero_grad()
            log_probs = model(torch.Tensor(text_data.context_ids[minibatchids]).long()) # This passes in a slice of instances from the data.
            loss = loss_function(log_probs,
                                 torch.autograd.Variable(
                                     torch.squeeze(
                                        torch.Tensor(text_data.target_ids[minibatchids]).long()
                                        )
                                     )
                                 ) * parms.MINI_BATCH
            loss.backward()
            optimizer.step()
...

This SEEMs to work and I get a ten times improvement in throughput, but maybe this is part of the problem? It shouldn’t be as its not the weights. Maybe there is a better way to implement this in a more PyTorch fashion.

Since I didn’t document it earlier, I am running:

Linux 16.04 LTS
Python 3.6
PyTorch 0.40 (CPU bound as no eligible GPU)

Any help would be greatly appreciated

ptrblck · July 2, 2018, 10:31pm

Could you help me out with some tensor shapes to get the code running without the dataset?

len(text_data.vocab)
shape of text_data.context_ids[minibatchids]).long()

skiddles · July 2, 2018, 10:49pm

len(text_data.vocab) = 15195

text_data.context_ids[minibatchids]).long() = torch.Size([999, 2])

Thanks

ptrblck · July 2, 2018, 11:08pm

I cloned your repo, since it was easier, but thanks for the numbers.

Using the default params, I get an error in calculating the loss.
While log_probs has a shape of [1, 15195], text_data.target_ids[minibatchids] has a shape of [1, 1].
Do you have the same issue?

skiddles · July 3, 2018, 11:39am

text_data.target_ids[minibatchids] should be [999,1]

log_probs`` shape is [999, 15195]

Let me check the default params on github. I am new to github so maybe the defaults are not what I expected.

skiddles · July 3, 2018, 8:25pm

I did actually update a few of the default parameters.

ptrblck · July 3, 2018, 8:48pm

Thanks! I pulled the current code and still there is an error regarding the loss function.
len(text_data.target_ids) is just 1. Could it be the Corpus.pkl is wrong?

skiddles · July 3, 2018, 10:20pm

I assume the pickle file (clean_corpus.pkl) is ok. I have downloaded it from github and it seems to work for me.

skiddles · July 6, 2018, 10:29pm

Hi ptrblck,

I really appreciate that you have been helping me with this. I have been trying to confirm as much as I can, although the holiday has been eating my time thankfully.

I used the following code to try and prove that the weights are being reloaded on a restart.

        f = open('./data/weights.txt', 'w')
        for x in model.parameters():
            f.write(str(x.data))
        f.close()

I run this code at the end of each epoch and after loading the model.

I have managed to confirm that the model is setting the weights on load. That said, the loss still goes back to basically the starting values. I am struggling to interpret this behavior. I am left with the following questions:

Is there a reasonable explanation for this behavior?
Is this a function of the loss measure?

Given that this is an unsupervised process, I cannot tell if the model is trained enough. I would assume that I would run it until the loss flatlines, but I cannot run this on my laptop that long, which is why I am focus on the restart function.

Any advice would be appreciated.

ptrblck · July 7, 2018, 12:55pm

Could you train you model for a while, save the optimizer, reload it, and continue the training in the same session?
The issue seems strange and I think we have to debug it step by step.

skiddles · July 9, 2018, 10:26pm

Hello again,

I did save the optimizer and reloaded it in session and the loss continued to decline from the previous epochs. When I did this in a subsequent session, the loss would bounce back up to the approximate loss from the first epoch.

Is the loss function a static construct? Should I be saving the loss function and reloading it next session?

Thanks ptrblck

ptrblck · July 9, 2018, 10:49pm

What kind of loss function are you using?
Your code just specifies loss_function.
Is it stateful, i.e. does it contain any parameters?

skiddles · July 9, 2018, 10:50pm

loss_function = nn.NLLLoss() no parameters.

Should it be stateful? How do I make it stateful?

Sounds like it should be stateful

ptrblck · July 9, 2018, 10:52pm

No, it’s fine. You don’t have to save and load it, since it’s just a function without any stored parameters.

skiddles · July 9, 2018, 10:53pm

Okay. I am at a loss (pun intended)

What to do next?

ptrblck · July 9, 2018, 10:56pm

Your current code in the repo was updated, right?
At least that’s not the local copy I have.
Currently you are accumulating the gradients in the train function.

I’ll pull again and try to run your code. The last time I still had size mismatch errors.

skiddles · July 10, 2018, 11:28am

There are minor differences between the repo and the test I ran saving the optimizer and reloading it.

Let me know if it does not work for you.