Hello,

The problem I am working on goes like this. I have a variable sequence with each timestep of the sequence having a label. The total number of classes are 3 and therefore is a multi-label classification problem. This implies that my inputs are of size [Batch_size, Max_seq_len, 20] and my labels are of size [Batch_Size, Max_seq_len, 1] for each batch and Max_seq_len changes with each batch.

I’ve used pack_padded_sequence with batch_first=True to get a packed sequence that can be sent as input to the LSTM. Thus, the input to the LSTM will be [Sum_batch_seq_lens, 20] and the output will be [Sum_batch_seq_lens, 2*lstm_dims] = [Sum_batch_seq_lens, 1024]. This is then sent to 3 dense layers to reduce the last dimension to 3. Thus the output of the BRNN class gives [Sum_batch_seq_lens, 3]. Please refer to the code below.

```
class BRNN(torch.nn.Module):
def __init__(self, input_dims=20, num_lstms=2, lstm_dims=512, out_dims=3):
super(BRNN, self).__init__()
self.brnn = torch.nn.LSTM(input_size=input_dims, hidden_size=lstm_dims, num_layers=2, bias=True, batch_first=True, bidirectional=True)
self.fc1 = torch.nn.Linear(in_features=2*lstm_dims, out_features=512)
self.fc2 = torch.nn.Linear(in_features=512, out_features=256)
self.fc3 = torch.nn.Linear(in_features=256, out_features=out_dims)
def forward(self, padded_input, input_lengths):
output = pack_padded_sequence(padded_input, input_lengths,
batch_first=True)
output, _ = self.brnn(output)
batch_sizes = output.batch_sizes
output = F.relu(self.fc1(output.data))
output = F.relu(self.fc2(output))
output = F.softmax(self.fc3(output), dim=1)
return output, batch_sizes
```

To train the model, I use

```
net = BRNN()
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9)
for i, inputs in enumerate(X_train):
labels = pack_padded_sequence(Y_train[i], seq_lens[i], batch_first=True)
optimizer.zero_grad()
outputs, batch_sizes = net(inputs, seq_lens[i])
loss = criterion(outputs, labels.data[:,0])
loss.backward()
optimizer.step()
```

After training the model, it is predicting the same class, 2, for all inputs. I have tried changing the dropout of the LSTM layer, learning rate of the optimizer, batch_first, but all are predicting the same class, either 2 or 0. The distribution of the classes throughout the dataset is {2: 745015, 0: 720913, 1: 439274}, i.e. occurence of class 2 is 745015, etc.

I’m having trouble with the following

- After packing the padded inputs, should the torch.nn.LSTM layer have batch_first=True? (Because pack_padded_sequence gives the same result for both [B, T, *] and [T, B, *])
- Is the output of torch.nn.LSTM layer being fed correctly to the torch.nn.Linear layer? (As in, I believe that unpacking of the PackedSequence is not required here, but I may be wrong)
- What is the role of torch.nn.functional (F) vs using torch.nn layer? (Is it that Autograd will not consider the functional layer for auto differentiation?)
- Is the softmax functional required at the end of the forward function? (I did not see it being used in a couple of examples)
- Is CrossEntropyLoss being used correctly here?

Thank you in advance and sorry for the trouble