Resume training from checkpoints produces different losses

ilyes · January 14, 2020, 1:04pm

Hello,

So as the title states, I am having peaks in the loss when I resume training eventhough I am saving everything in the checkpoint : model state, optimizer state, and having a manual seed. like indicated below.

Dataloaders:
a function that returns the dataloaders at the start of my training program.

    torch.manual_seed(1)
    indices = torch.randperm(len(train_dataset)).tolist()
    train_idx, valid_idx, test_idx = indices[val_size + test_size:], indices[:val_size], indices[
                                                                                         val_size:test_size + val_size]
    train_sampler = SubsetRandomSampler(train_idx)
    val_sampler = SubsetRandomSampler(valid_idx)
    test_sampler = SubsetRandomSampler(test_idx)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler,
                              pin_memory=torch.cuda.is_available(), num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=val_sampler,
                            pin_memory=torch.cuda.is_available(), num_workers=4)
    test_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=test_sampler,
                             pin_memory=torch.cuda.is_available(), num_workers=4)

Saving:
I save the checkpoint just after the training pass, and before doing validation or test, means the model is still model.train().

        torch.save({
            'epoch': e,
            'model_state_dict': mymodel.state_dict(),
            'best_loss': best_loss,
            'optimizer_state_dict': optim.state_dict(),
        }, os.path.join("Checkpoints", path, 'training_state.pt'))

Loading:
I check if there is a checkpoint at a given directory, and load it at the start of my training program.

    starting_epoch = 0
    best_loss = 100000
    if os.path.exists(os.path.join("Checkpoints", path, 'training_state.pt')):
        checkpoint = torch.load(os.path.join("Checkpoints", path, 'training_state.pt'), map_location=device)
        mymodel.load_state_dict(checkpoint['model_state_dict'])
        optim.load_state_dict(checkpoint['optimizer_state_dict'])
        starting_epoch = checkpoint['epoch']
        best_loss = checkpoint['best_loss']

Do you guys see anything that may cause peaks in the loss after resuming training ? it goes for example from 0.54 to 0.7 when resumed

ptrblck · January 15, 2020, 4:40am

The posted code looks fine.
Could you post some information about your model and training routine?
If possible, could you post your model definition?

pinocchio · May 2, 2020, 3:40pm

did you ever solve this issue?

pranayKD · July 9, 2020, 4:28am

Hi,

I am facing the exact same problem. Is there any solution for the training to continue from the same loss?

Thanks

blenderender · July 9, 2020, 9:40am

Does the peak in loss occur when using adaptive optimizers or even with something like SGD with no momentum?

pranayKD · July 9, 2020, 5:20pm

I am doing some experiments to figure out the root cause. Its happening even in case of SGD optmizer with no momentum.

optimizer = optim.SGD(net.parameters(), lr=0.001)

In the following image, the slight loss jump after 2000 iterations, is after the training is continued from a checkpoint.

pranayKD · July 10, 2020, 3:16pm

Found the root cause for my problem. Nothing to do with model saving or loading. Just a silly mistake.