Multi-layered bidirectional LSTM doesn't learn very well

Hello,
The problem I am working on goes like this. I have a variable sequence with each timestep of the sequence having a label. The total number of classes are 3 and therefore is a multi-label classification problem. This implies that my inputs are of size [Batch_size, Max_seq_len, 20] and my labels are of size [Batch_Size, Max_seq_len, 1] for each batch and Max_seq_len changes with each batch.

I’ve used pack_padded_sequence with batch_first=True to get a packed sequence that can be sent as input to the LSTM. Thus, the input to the LSTM will be [Sum_batch_seq_lens, 20] and the output will be [Sum_batch_seq_lens, 2*lstm_dims] = [Sum_batch_seq_lens, 1024]. This is then sent to 3 dense layers to reduce the last dimension to 3. Thus the output of the BRNN class gives [Sum_batch_seq_lens, 3]. Please refer to the code below.

class BRNN(torch.nn.Module):
    
    def __init__(self, input_dims=20, num_lstms=2, lstm_dims=512, out_dims=3):
        super(BRNN, self).__init__()
        self.brnn = torch.nn.LSTM(input_size=input_dims, hidden_size=lstm_dims, num_layers=2, bias=True, batch_first=True, bidirectional=True)
        self.fc1 = torch.nn.Linear(in_features=2*lstm_dims, out_features=512)
        self.fc2 = torch.nn.Linear(in_features=512, out_features=256)
        self.fc3 = torch.nn.Linear(in_features=256, out_features=out_dims)
        
    def forward(self, padded_input, input_lengths):
        output = pack_padded_sequence(padded_input, input_lengths,
                                            batch_first=True)
        output, _ = self.brnn(output)
        batch_sizes = output.batch_sizes
        output = F.relu(self.fc1(output.data))
        output = F.relu(self.fc2(output))
        output = F.softmax(self.fc3(output), dim=1)
        return output, batch_sizes

To train the model, I use

net = BRNN()
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9)
for i, inputs in enumerate(X_train):
        labels = pack_padded_sequence(Y_train[i], seq_lens[i], batch_first=True)
        optimizer.zero_grad()
        outputs, batch_sizes = net(inputs, seq_lens[i])
        loss = criterion(outputs, labels.data[:,0])
        loss.backward()
        optimizer.step()

After training the model, it is predicting the same class, 2, for all inputs. I have tried changing the dropout of the LSTM layer, learning rate of the optimizer, batch_first, but all are predicting the same class, either 2 or 0. The distribution of the classes throughout the dataset is {2: 745015, 0: 720913, 1: 439274}, i.e. occurence of class 2 is 745015, etc.

I’m having trouble with the following

  1. After packing the padded inputs, should the torch.nn.LSTM layer have batch_first=True? (Because pack_padded_sequence gives the same result for both [B, T, *] and [T, B, *])
  2. Is the output of torch.nn.LSTM layer being fed correctly to the torch.nn.Linear layer? (As in, I believe that unpacking of the PackedSequence is not required here, but I may be wrong)
  3. What is the role of torch.nn.functional (F) vs using torch.nn layer? (Is it that Autograd will not consider the functional layer for auto differentiation?)
  4. Is the softmax functional required at the end of the forward function? (I did not see it being used in a couple of examples)
  5. Is CrossEntropyLoss being used correctly here?

Thank you in advance and sorry for the trouble

Apart from all you question, I don’t see that you initialize the hidden state of you LSTM layer in each iteration; see this post. There should be something like:

for i, inputs in enumerate(X_train):
    # Initialize hidden layer
    model.hidden = model.init_hidden(batch_size)
    ...

and you model class having a method init_hidden like

def init_hidden(self, batch_size):
    return (torch.zeros(self.num_layers * self.directions_count, batch_size, self.rnn_hidden_dim).to(self.device),
            torch.zeros(self.num_layers * self.directions_count, batch_size, self.rnn_hidden_dim).to(self.device))

I just copied from my code, so you would need to adopt it to your requirements.