Loading model state stops working after nesting modules

masoncusack · February 20, 2019, 11:10pm

I had successfully staggered model training, saving state after every epoch using the standard PyTorch method (saving state_dict, optimizer state and so on to a dictionary and serializing), for separate Encoder and Decoder objects.

However, when incorporating these separate objects as layers in an aggregate ‘Model’ object, and trying to save and load its state after a stretch of training, my loss figures in the next epoch go back to as though I started training from scratch.

I am doing everything as before except that I have nested PyTorch modules, and train the model which initialises and incorporates the encoder and decoder.

Am I incorrect in thinking I should be able to nest modules in this way and save/load the state of the module which contains the others to resume training?

What’s especially odd is that training is working with this setup, but just won’t resume.

InnovArul · February 21, 2019, 2:38am

can you give a code snippet of your main model to know how are you nesting the models? Whether as instance variables or using nn.ModuleList etc?

masoncusack · February 21, 2019, 10:05am

Hi Arul, sure can:

class Model(nn.Module):
  def __init__(self, word_dim, annotation_dim, attention_dim, hidden_dim, context_dim, vocab_size=111):
    super(Model, self).__init__()
    
    #Variables
    self.word_dim = word_dim
    self.annotation_dim = annotation_dim
    self.attention_dim = attention_dim
    self.hidden_dim = hidden_dim
    self.context_dim = context_dim
    self.vocab_size = vocab_size
    
    #Components
    self.encoder = Encoder().to(device) #FCN encoder
    self.decoder = Decoder(word_dim=self.word_dim, annotation_dim=self.annotation_dim, 
                           attention_dim=self.attention_dim, hidden_dim=self.hidden_dim, 
                           context_dim=self.context_dim, vocab_size=self.vocab_size).to(device) #Decoder with custom Attention model and GRU
    
  def forward(self, image_batch, label):
    
    #fcn_output is raw output for attention visualization
    #alphas are attention weights for attention visualization
    
    fcn_output, annotations = self.encoder(image_batch.unsqueeze(0).float().to(device))
        
    alphas, predictions = self.decoder(annotations, self.decoder.initHidden(), label) #predictions for this sequence (set of probabilties)
    
    return fcn_output, alphas, predictions

masoncusack · February 21, 2019, 7:42pm

I’ve also noticed that looking at named_parameters, the parameters of the Decoder’s nested Attention Model and GRU are missing.

masoncusack · February 21, 2019, 8:02pm

Think I may have solved it. Custom weights defined in the Decoder’s internal GRU and attention model (both self-written) were not defined using nn.Parameter(), but rather simply as variables. I just wrapped one weight from the attention model in this and it now appears in the set of named parameters:

I’ll keep you updated. If updating all parameters as such works, there is still the issue of whether state_dict() will recover these parameters, but I suspect it will in the same fashion as named_parameters().

InnovArul · February 21, 2019, 8:12pm

Yes. I think you spotted the issue. The custom parameters have to be wrapped with nn.Parameter(). Saving and loading state_dict should work with this as well.

Good luck!

masoncusack · February 21, 2019, 8:13pm

Awesome, thanks for taking a look.

InnovArul · February 21, 2019, 8:15pm

I did take a look before and the code looked ok to me. I wouldn’t have spotted the bug in invisible code anyway

masoncusack · February 21, 2019, 11:05pm

Tried training for a couple of epochs, stopping, saving state, and resuming, and indeed this was the issue. Loss now continues to descend expectedly with training staggered.