Resume training from checkpoints produces different losses

Hello,

So as the title states, I am having peaks in the loss when I resume training eventhough I am saving everything in the checkpoint : model state, optimizer state, and having a manual seed. like indicated below.

Dataloaders:
a function that returns the dataloaders at the start of my training program.

    torch.manual_seed(1)
    indices = torch.randperm(len(train_dataset)).tolist()
    train_idx, valid_idx, test_idx = indices[val_size + test_size:], indices[:val_size], indices[
                                                                                         val_size:test_size + val_size]
    train_sampler = SubsetRandomSampler(train_idx)
    val_sampler = SubsetRandomSampler(valid_idx)
    test_sampler = SubsetRandomSampler(test_idx)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler,
                              pin_memory=torch.cuda.is_available(), num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=val_sampler,
                            pin_memory=torch.cuda.is_available(), num_workers=4)
    test_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=test_sampler,
                             pin_memory=torch.cuda.is_available(), num_workers=4)

Saving:
I save the checkpoint just after the training pass, and before doing validation or test, means the model is still model.train().

        torch.save({
            'epoch': e,
            'model_state_dict': mymodel.state_dict(),
            'best_loss': best_loss,
            'optimizer_state_dict': optim.state_dict(),
        }, os.path.join("Checkpoints", path, 'training_state.pt'))

Loading:
I check if there is a checkpoint at a given directory, and load it at the start of my training program.

    starting_epoch = 0
    best_loss = 100000
    if os.path.exists(os.path.join("Checkpoints", path, 'training_state.pt')):
        checkpoint = torch.load(os.path.join("Checkpoints", path, 'training_state.pt'), map_location=device)
        mymodel.load_state_dict(checkpoint['model_state_dict'])
        optim.load_state_dict(checkpoint['optimizer_state_dict'])
        starting_epoch = checkpoint['epoch']
        best_loss = checkpoint['best_loss']

Do you guys see anything that may cause peaks in the loss after resuming training ? it goes for example from 0.54 to 0.7 when resumed

2 Likes

The posted code looks fine.
Could you post some information about your model and training routine?
If possible, could you post your model definition?

did you ever solve this issue?


related:

Hi,

I am facing the exact same problem. Is there any solution for the training to continue from the same loss?

Thanks

3 Likes

Does the peak in loss occur when using adaptive optimizers or even with something like SGD with no momentum?

I am doing some experiments to figure out the root cause. Its happening even in case of SGD optmizer with no momentum.

optimizer = optim.SGD(net.parameters(), lr=0.001)

In the following image, the slight loss jump after 2000 iterations, is after the training is continued from a checkpoint.
image

Found the root cause for my problem. Nothing to do with model saving or loading. Just a silly mistake.

1 Like