Loading model state stops working after nesting modules

I had successfully staggered model training, saving state after every epoch using the standard PyTorch method (saving state_dict, optimizer state and so on to a dictionary and serializing), for separate Encoder and Decoder objects.

However, when incorporating these separate objects as layers in an aggregate ‘Model’ object, and trying to save and load its state after a stretch of training, my loss figures in the next epoch go back to as though I started training from scratch.

I am doing everything as before except that I have nested PyTorch modules, and train the model which initialises and incorporates the encoder and decoder.

Am I incorrect in thinking I should be able to nest modules in this way and save/load the state of the module which contains the others to resume training?

What’s especially odd is that training is working with this setup, but just won’t resume.

can you give a code snippet of your main model to know how are you nesting the models? Whether as instance variables or using nn.ModuleList etc?

Hi Arul, sure can:

class Model(nn.Module):
  def __init__(self, word_dim, annotation_dim, attention_dim, hidden_dim, context_dim, vocab_size=111):
    super(Model, self).__init__()
    self.word_dim = word_dim
    self.annotation_dim = annotation_dim
    self.attention_dim = attention_dim
    self.hidden_dim = hidden_dim
    self.context_dim = context_dim
    self.vocab_size = vocab_size
    self.encoder = Encoder().to(device) #FCN encoder
    self.decoder = Decoder(word_dim=self.word_dim, annotation_dim=self.annotation_dim, 
                           attention_dim=self.attention_dim, hidden_dim=self.hidden_dim, 
                           context_dim=self.context_dim, vocab_size=self.vocab_size).to(device) #Decoder with custom Attention model and GRU
  def forward(self, image_batch, label):
    #fcn_output is raw output for attention visualization
    #alphas are attention weights for attention visualization
    fcn_output, annotations = self.encoder(image_batch.unsqueeze(0).float().to(device))
    alphas, predictions = self.decoder(annotations, self.decoder.initHidden(), label) #predictions for this sequence (set of probabilties)
    return fcn_output, alphas, predictions

I’ve also noticed that looking at named_parameters, the parameters of the Decoder’s nested Attention Model and GRU are missing.

Think I may have solved it. Custom weights defined in the Decoder’s internal GRU and attention model (both self-written) were not defined using nn.Parameter(), but rather simply as variables. I just wrapped one weight from the attention model in this and it now appears in the set of named parameters:


I’ll keep you updated. If updating all parameters as such works, there is still the issue of whether state_dict() will recover these parameters, but I suspect it will in the same fashion as named_parameters().

Yes. I think you spotted the issue. The custom parameters have to be wrapped with nn.Parameter(). Saving and loading state_dict should work with this as well.

Good luck!

1 Like

Awesome, thanks for taking a look. :slight_smile:

I did take a look before and the code looked ok to me. I wouldn’t have spotted the bug in invisible code anyway :slight_smile:

Tried training for a couple of epochs, stopping, saving state, and resuming, and indeed this was the issue. Loss now continues to descend expectedly with training staggered. :tada: