Model loading destroy my jupyter kernel

Samurai_Spirit · March 3, 2024, 6:56pm

Hi all, I am training a model, and every n iterations I save it to disk with the code.

torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict()
            }, os.path.join(model_dir, f'model_{epoch}_{i}.pth'))

Everything works fine, the model is trained.But when I try to load the model from disk, jupyter notebook crashes with error “Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may…”. According to my observations, this error occurs when memory overflows occur. How can i optimize model saving/loading?

Edit1: It looks like loading exactly the latest model breaks everything. I’ve tried before with a save every 6000 batches, and I only had one saved model file on which the error occurred. Now I tried saving the model every 80 batches (for the sake of experimentation), the most recent file did not load, but the one before that loaded fine. Apparently it’s not a memory shortage error, but a serialization error?

mattmaitland1 · March 21, 2024, 1:06am

I am running into the same issue. Did you ever find a solution?