RNN gradients for variable length input sequences are very small

I’m aiming to do multiclass classification on sentences. The input to the my RNN(LSTM or GRU) is a batched input of variable length sequences(which are indexed using Glove embeddings). This input is right padded with zeros. The redefined forward for my GRU RNN is:

def last_timestep(self, unpacked, lengths):
        # Index of the last output for each sequence
        idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0), unpacked.size(2)).unsqueeze(1)
        return unpacked.gather(1, idx).squeeze()

    def forward(self, x, lengths, **kwargs):
        """Forward propagation of activations"""

        if self.gpu:
            x = Variable(x).cuda()
            lengths = Variable(lengths).cuda()
        else:
            x = Variable(x)
            lengths = Variable(lengths)

        # batch_size = int(x.size()[0])
        # h_0 = Variable(torch.zeros(self.total_layers, batch_size, self.hidden_size)).cuda()

        # Embed and pack the padded sequence
        embs = self.embeddings(x)
        packed = pack_padded_sequence(embs, list(lengths.data), batch_first=True)
        out_packed, _ = self.gru(packed)
        out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
        out_last = self.last_timestep(out_unpacked, lengths)
        output = self.lin(out_last)
        return output

As for training, I’m using CrossEntropyLoss. However, when I test prediction, it always predicts the same class irrespective of the sentence input. Moreover, the final output from the RNN(variable output) is almost the same! On closer inspection, I’ve discovered that the problem is in backpropagation. The gradients are very very low(in the order of 10^-3 and some are much lower) for many of the parameters. Moreover, I’m not sure if the packing or padding is helping at all?? I’ve tried running the code without any packing(just running forward on the padded input) and I get the same output, which leads me to believe that I’m doing something wrong with packing and unpacking. I’d really appreciate any help. Thank you!

1 Like

I have the same problem here. When I used pack_padded_sequence, sum of some gradients is zero. So I decide to use padding without pack, but the gradients of layers is about 10^8, and sum of the first weights(rnn.weight_ih_l0) == 0. Did you already find a solution?

Optimzer = Adam
Loss = CrossEntropy
Learning_Rates = 0.1 - 0.001
Batch_size = 10 - 40
Total Samples to train = 1100

Wish I had something new to add, but no. I stopped working on that a while back, and just came back to it. Maybe I’ll figure it out now!

I do not know if this is correct but i think squeeze() here should be done by time dimension which is 1 when batch_first=True. Another thing i notices is that you do not pass lengths to pack_padded_sequence which does not make much sense to me.

I have the same issue actually. Dont know how to solve it yet.

I think I was passing list(lengths.data) to pack_padded_sequence because it was an old version of pyTorch, but right now I’m passing lengths which works. squeeze() isn’t an issue for me, since I just want to get rid of the 1 dimension where ever it may be, but yeah it should be at position 1.

Even I am facing the same problem in LSTM. My gradients are of the order of 10^-9. Therefore my model is not learning anything and predicting same class in each iteration. Can someone please tell a solution?