I had successfully staggered model training, saving state after every epoch using the standard PyTorch method (saving state_dict, optimizer state and so on to a dictionary and serializing), for separate Encoder and Decoder objects.
However, when incorporating these separate objects as layers in an aggregate ‘Model’ object, and trying to save and load its state after a stretch of training, my loss figures in the next epoch go back to as though I started training from scratch.
I am doing everything as before except that I have nested PyTorch modules, and train the model which initialises and incorporates the encoder and decoder.
Am I incorrect in thinking I should be able to nest modules in this way and save/load the state of the module which contains the others to resume training?
What’s especially odd is that training is working with this setup, but just won’t resume.
can you give a code snippet of your main model to know how are you nesting the models? Whether as instance variables or using nn.ModuleList etc?
Hi Arul, sure can:
def __init__(self, word_dim, annotation_dim, attention_dim, hidden_dim, context_dim, vocab_size=111):
self.word_dim = word_dim
self.annotation_dim = annotation_dim
self.attention_dim = attention_dim
self.hidden_dim = hidden_dim
self.context_dim = context_dim
self.vocab_size = vocab_size
self.encoder = Encoder().to(device) #FCN encoder
self.decoder = Decoder(word_dim=self.word_dim, annotation_dim=self.annotation_dim,
context_dim=self.context_dim, vocab_size=self.vocab_size).to(device) #Decoder with custom Attention model and GRU
def forward(self, image_batch, label):
#fcn_output is raw output for attention visualization
#alphas are attention weights for attention visualization
fcn_output, annotations = self.encoder(image_batch.unsqueeze(0).float().to(device))
alphas, predictions = self.decoder(annotations, self.decoder.initHidden(), label) #predictions for this sequence (set of probabilties)
return fcn_output, alphas, predictions
I’ve also noticed that looking at named_parameters, the parameters of the Decoder’s nested Attention Model and GRU are missing.
Think I may have solved it. Custom weights defined in the Decoder’s internal GRU and attention model (both self-written) were not defined using nn.Parameter(), but rather simply as variables. I just wrapped one weight from the attention model in this and it now appears in the set of named parameters:
I’ll keep you updated. If updating all parameters as such works, there is still the issue of whether state_dict() will recover these parameters, but I suspect it will in the same fashion as named_parameters().
Yes. I think you spotted the issue. The custom parameters have to be wrapped with
nn.Parameter(). Saving and loading state_dict should work with this as well.
Awesome, thanks for taking a look.
I did take a look before and the code looked ok to me. I wouldn’t have spotted the bug in invisible code anyway
Tried training for a couple of epochs, stopping, saving state, and resuming, and indeed this was the issue. Loss now continues to descend expectedly with training staggered.