I’m aiming to do multiclass classification on sentences. The input to the my RNN(LSTM or GRU) is a batched input of variable length sequences(which are indexed using Glove embeddings). This input is right padded with zeros. The redefined forward for my GRU RNN is:

```
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0), unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths, **kwargs):
"""Forward propagation of activations"""
if self.gpu:
x = Variable(x).cuda()
lengths = Variable(lengths).cuda()
else:
x = Variable(x)
lengths = Variable(lengths)
# batch_size = int(x.size()[0])
# h_0 = Variable(torch.zeros(self.total_layers, batch_size, self.hidden_size)).cuda()
# Embed and pack the padded sequence
embs = self.embeddings(x)
packed = pack_padded_sequence(embs, list(lengths.data), batch_first=True)
out_packed, _ = self.gru(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
out_last = self.last_timestep(out_unpacked, lengths)
output = self.lin(out_last)
return output
```

As for training, I’m using CrossEntropyLoss. However, when I test prediction, it always predicts the same class irrespective of the sentence input. Moreover, the final output from the RNN(variable output) is almost the same! On closer inspection, I’ve discovered that the problem is in backpropagation. The gradients are very very low(in the order of 10^-3 and some are much lower) for many of the parameters. Moreover, I’m not sure if the packing or padding is helping at all?? I’ve tried running the code without any packing(just running forward on the padded input) and I get the same output, which leads me to believe that I’m doing something wrong with packing and unpacking. I’d really appreciate any help. Thank you!