Hi all, I am training a model, and every n iterations I save it to disk with the code.
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict()
}, os.path.join(model_dir, f'model_{epoch}_{i}.pth'))
Everything works fine, the model is trained.But when I try to load the model from disk, jupyter notebook crashes with error “Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may…”. According to my observations, this error occurs when memory overflows occur. How can i optimize model saving/loading?
Edit1: It looks like loading exactly the latest model breaks everything. I’ve tried before with a save every 6000 batches, and I only had one saved model file on which the error occurred. Now I tried saving the model every 80 batches (for the sake of experimentation), the most recent file did not load, but the one before that loaded fine. Apparently it’s not a memory shortage error, but a serialization error?