Output of LSTM for linear classification

Hi everyone! i have a biLSTM model which I’m using to classify posts. It is a binary classification task. I am using batch first so the input to the lstm is of the shape [8x50x768], I then take the ‘output’ of the lstm layer which is of shape [8x50x40]. I then pass it through a linear layer and then a sigmoid function to map the output to a value between 0 and 1. However, the output after all this is a 3d tensor of shape [8x50x1] and I’m unsure how to use that to get a singular value for each item in the batch to compare to the labels which are a list of values of size of the batch.

I read online that you can use max pooling? but I’m not entirely sure if that is correct.I tried using that with torch.max(output, 1). And that leaves me with a tensor that’s [8x40] and after I pass that through the other layers I end with [8x1] which I then squeeze so I can compare with the labels. Not sure if any of this is “correct” though, so would appreciate if someone could explain that to me.

A side question I had was whether you should use the output part of the lstm output for binary classification or if you should use the h_n or c_n parts?

When you say output of the LSTM layer, I assume you have something like

output, (hidden, cell) = self.lstm(...)

in your code. output contains hidden states for all time steps (of the last layer in case you have num_layer > 1). In principle you could average all hidden layers.

However, the default approach is usually to consider only the last hidden state with respect to the forward direction and the last hidden state with respect to the backward direction. You can then either sum/average these to hidden states or concatenate them. Depending on your choice, the input size for the linear layer is either hidden_size or 2*hidden_size.

For an example, you can check out this code for the GRU/LSTM classifier. The relevant snippets are below the comments # Extract last hidden state and # Handle directions