Set requires_grad to False but still slowing the speed

class LCNPModel(nn.Module):
"""Container module with an encoder, a recurrent module, and a decoder."""

def __init__(self, inputs):
    super(LCNPModel, self).__init__()

    self.encoder_nt = nn.Embedding(self.nnt, self.dnt)
    self.word2vec_plus = nn.Embedding(self.nt, self.dt)
    self.word2vec = nn.Embedding(self.nt, self.dt) 

    self.LSTM = nn.LSTM(self.dt, self.dhid, self.nlayers, batch_first=True, bias=True)
    # the initial states for h0 and c0 of LSTM
    self.h0 = (Variable(torch.zeros(self.nlayers, self.bsz, self.dhid)),
            Variable(torch.zeros(self.nlayers, self.bsz, self.dhid)))



    self.l2 = itertools.ifilter(lambda p: p.requires_grad == True, self.parameters())

def init_weights(self, initrange=1.0): = self.term_emb = self.nonterm_emb      

    self.word2vec.weight.requires_grad = False
    self.encoder_nt.weight.requires_grad = False 

Hi, above is part of my code. In my code, I have a word2vec and word2vec_plus embeddings. So I would like an embedding which is initialized as word2vec pretrained vectors, but keep training it further. But when I use the optimizer, I would like to take the l2 norm of this embedding as the distance between the current embedding with the original word2vec embedding, which makes sense since I don’t want the new trained embedding to be too far away from the pretrained one.

My problem is, when I set the word2vec.weight.requires_grad to False, and optimize parameters that require gradient, everything is fine but the training time is too slow after the first round. However, if I comment out everything with word2vec but only use word2vec_plus, everything is very fast. Since you can think of I am using word2vec just as a normal constant in my code, it is not supposed to slow down the model training.

So my question is, if there is any way to speed up this process or is there anything that I am doing wrong?

Thanks a lot!

1 Like

It’s probably the L2 norm distance between the two embedding matrices that’s taking forever to calculate, and there isn’t really a way around that for now (you may update only parts of the word2vec_plus embedding matrix at each iteration but you have to recompute the L2 norm over the whole matrix).


The problem is that even when I comment out the word2vec embedding, which is a constant since I don’t require gradient for it, and optimize over everything else, I expect this is the same as not commenting out it (because the optimizer has nothing to do with it), it still gave me speed up. So I think even when requires_grad is false, the optimizer, for some reason, still looks at it?