I had successfully staggered model training, saving state after every epoch using the standard PyTorch method (saving state_dict, optimizer state and so on to a dictionary and serializing), for separate Encoder and Decoder objects.
However, when incorporating these separate objects as layers in an aggregate ‘Model’ object, and trying to save and load its state after a stretch of training, my loss figures in the next epoch go back to as though I started training from scratch.
I am doing everything as before except that I have nested PyTorch modules, and train the model which initialises and incorporates the encoder and decoder.
Am I incorrect in thinking I should be able to nest modules in this way and save/load the state of the module which contains the others to resume training?
What’s especially odd is that training is working with this setup, but just won’t resume.
Think I may have solved it. Custom weights defined in the Decoder’s internal GRU and attention model (both self-written) were not defined using nn.Parameter(), but rather simply as variables. I just wrapped one weight from the attention model in this and it now appears in the set of named parameters:
I’ll keep you updated. If updating all parameters as such works, there is still the issue of whether state_dict() will recover these parameters, but I suspect it will in the same fashion as named_parameters().
Yes. I think you spotted the issue. The custom parameters have to be wrapped with nn.Parameter(). Saving and loading state_dict should work with this as well.
Tried training for a couple of epochs, stopping, saving state, and resuming, and indeed this was the issue. Loss now continues to descend expectedly with training staggered.