Training loss increase after loading model, optimizer and scheduler

I’m working in a research computer vision project and I have a particular problem that doesn’t allow me to resume training properly after a crash or interrupt since my training loss increases, this is my code to load the checkpoint:

from torch.optim import lr_scheduler
N_EPOCHS = 120
if load_weights:
    optimizer = torch.optim.Adam(model.parameters(),lr=checkpoint['last_lr'], weight_decay=weight_decay)
    optimizer.load_state_dict(checkpoint['optimizer'])
    scheduler = lr_scheduler.StepLR(optimizer, step_size=step_of_scheduler, gamma=0.9, last_epoch=loaded_epochs)
    scheduler.load_state_dict(checkpoint['scheduler'])
else:
    optimizer = torch.optim.Adam(model.parameters(),lr=initial_lr, weight_decay=weight_decay)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=step_of_scheduler, gamma=0.9)

I even tried adding optimizer.state_dict()['param_groups'][0]['params'] = checkpoint['optimizer']['param_groups'][0]['params'] but the result is even worse, my piece of code to save the checkpoint after validation is:

# Checkpoint
       checkpoint = {'model':model.state_dict(),
                     'epoch':loaded_epochs + epoch + 1,
                     'last_validation_acc': val_acc_db_avg[-1],
                     'hyperparameters': hyperparameters,
                     'last_lr': optimizer.param_groups[0]['lr'],
                     'best_general_val': max(val_acc_db_avg),
                     'last_train_loss': train_loss_db[-1],
                     'optimizer': optimizer.state_dict(),
                     'scheduler': scheduler.state_dict(),
                     'img_resize': 512 if lr_scaled else 256,
                     'best_train_acc': max(train_acc_db),
                     'lr_scaled': lr_scaled
                     }

       # Backup in drive
       torch.save(checkpoint, filename_of_checkpoint)

I did make sure to call model.train() before training starts and also I called scheduler.step(), this being said, my increase in training loss is from 0.81 to 0.89 or 0.90 if I manually assign the params, I don’t know what might be going wrong here since I also load the model state dict properly.

This is how I load my model state dict:

if load_weights:
    model.load_state_dict(checkpoint_load['model'])
    
# Transfering model to GPU if available
model = model.to(DEVICE)
2 Likes

The code looks generally alright, at least I cannot find any obvious errors.
Could you post your setup (hyperparameters of the optimizer, learning rate scheduler etc.) so that we could try to reproduce this issue with a dummy model?

1 Like

We have the same exact issue, with the exception that we do not use a scheduler since we are using Adam. We also use L2 regularization via weight_decay parameter. Did you find a solution @marcelodiaz @ptrblck?

Hi, I didn’t find any obvious solution for this problem since I had limited time at the moment and I decided acquiring the infrastructure for training in a single pass, however post the code on how you’re handling the checkpoint to see if it’s anything wrong, also try using the same random seed for all the training process, I guess that probably if you’re using any special sampler or operation that depends on randomness it could affect the performance if the seed is changed.

This is the code I use for seeding everything:

import numpy as np
import random
import os
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(123)

Thanks for your feedback. I will try to use the same seed and will keep you posted.