I’m working in a research computer vision project and I have a particular problem that doesn’t allow me to resume training properly after a crash or interrupt since my training loss increases, this is my code to load the checkpoint:
from torch.optim import lr_scheduler
N_EPOCHS = 120
if load_weights:
optimizer = torch.optim.Adam(model.parameters(),lr=checkpoint['last_lr'], weight_decay=weight_decay)
optimizer.load_state_dict(checkpoint['optimizer'])
scheduler = lr_scheduler.StepLR(optimizer, step_size=step_of_scheduler, gamma=0.9, last_epoch=loaded_epochs)
scheduler.load_state_dict(checkpoint['scheduler'])
else:
optimizer = torch.optim.Adam(model.parameters(),lr=initial_lr, weight_decay=weight_decay)
scheduler = lr_scheduler.StepLR(optimizer, step_size=step_of_scheduler, gamma=0.9)
I even tried adding optimizer.state_dict()['param_groups'][0]['params'] = checkpoint['optimizer']['param_groups'][0]['params']
but the result is even worse, my piece of code to save the checkpoint after validation is:
# Checkpoint
checkpoint = {'model':model.state_dict(),
'epoch':loaded_epochs + epoch + 1,
'last_validation_acc': val_acc_db_avg[-1],
'hyperparameters': hyperparameters,
'last_lr': optimizer.param_groups[0]['lr'],
'best_general_val': max(val_acc_db_avg),
'last_train_loss': train_loss_db[-1],
'optimizer': optimizer.state_dict(),
'scheduler': scheduler.state_dict(),
'img_resize': 512 if lr_scaled else 256,
'best_train_acc': max(train_acc_db),
'lr_scaled': lr_scaled
}
# Backup in drive
torch.save(checkpoint, filename_of_checkpoint)
I did make sure to call model.train() before training starts and also I called scheduler.step(), this being said, my increase in training loss is from 0.81 to 0.89 or 0.90 if I manually assign the params, I don’t know what might be going wrong here since I also load the model state dict properly.
This is how I load my model state dict:
if load_weights:
model.load_state_dict(checkpoint_load['model'])
# Transfering model to GPU if available
model = model.to(DEVICE)