LSTM Layer producing same outputs for different sequences

Hey there,
I guess I am still rather inexperienced with PyTorch and this is the first time I am using a sequence data based learning model, i.e. LSTM.

Currently I try to train on a multi-label language task with imbalanced class distribution. I have the following model, where I removed some of the feed forward layers to decrease factors in the chain of gradients.

Since the outputs are extremely weird during inference time (i.e. every prediction is class 1 of 32 and no others), I started to check the layers, esp. the LSTM layer to see if any inconsistencies occur.

First let me share my model-architecture with you.

class Bi_RNN(nn.Module):
    """"
    Embedding Dim 300
    """
    def __init__(self, hidden_dim_lstm, in_2_dim, in_3_dim, in_4_dim, input_dim=300, output_dim=32, num_layers=1, batch_size=1):
        super(Bi_RNN, self).__init__()
        
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim_lstm*2*num_layers
        self.hidden_dim_lstm = hidden_dim_lstm
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.in_2_dim = in_2_dim
        self.in_3_dim = in_3_dim
        self.in_4_dim = in_4_dim
        self.act = nn.PReLU()

        # Define the LSTM layer
        self.lstm = nn.LSTM(self.input_dim, self.hidden_dim_lstm, self.num_layers, batch_first=True, bidirectional=True)

        # Define the FFN
        self.linear_layer_1 = nn.Linear(self.hidden_dim, self.in_4_dim)
        self.linear_layer_last = nn.Linear(self.in_4_dim, output_dim)  
        
    def init_hidden(self):
        # This is what we'll initialise our hidden state as
        device = next(self.parameters()).device.type
        return (torch.zeros(self.num_layers*2, self.batch_size, self.hidden_dim//2).to(device),
                torch.zeros(self.num_layers*2, self.batch_size, self.hidden_dim//2).to(device))
    
    def forward(self, input):
        lstm_out, self.hidden = self.lstm(input, self.init_hidden())
        h_n, c_n = self.hidden
        c_n_merged = c_n.reshape(self.batch_size, -1)
  
        layer_1_out = self.act(self.linear_layer_1(c_n_merged))
        out = self.linear_layer_last(layer_1_out)
        out = torch.sigmoid(out)

        return out

This is the model state after training. Consider the following inputs x each with shape torch.Size([7484, 300]) (it’s actually a batch with torch.Size([64, 7484, 300])).

x_1 looks like this

tensor([[-0.1113,  0.1436,  0.1895,  ...,  0.0342,  0.1602, -0.2500],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.2910,  0.1787,  0.0500,  ..., -0.0228,  0.1177,  0.3535],
         ...

x_2 looks like this

tensor([[ 0.1250,  0.0266, -0.0272,  ..., -0.0864, -0.1621, -0.0337],
         [ 0.0070, -0.0732,  0.1719,  ...,  0.0112,  0.1641,  0.1069],
         [ 0.0762,  0.0820, -0.1118,  ..., -0.0942, -0.0684,  0.2266],
         ...

So when getting the LSTM out of the model and passing these vectors (as a batch) into the LSTM, the hidden states c_n, h_n with shape torch.Size([800]) are identical (most of them are for the complete batch)

The c_n look like this

tensor([[-0.1549,  0.0412, -0.0041,  ..., -0.1105, -0.0761,  0.0696],
        [-0.1549,  0.0412, -0.0041,  ..., -0.1105, -0.0761,  0.0696]],

and the h_n look like this

tensor([[-0.0746,  0.0206, -0.0020,  ..., -0.0547, -0.0372,  0.0344],
        [-0.0746,  0.0206, -0.0020,  ..., -0.0547, -0.0372,  0.0344]],

I don’t understand how this is happening and I would be very grateful if somebody can point out what my misconception is.

I am sorry in advance if there are any rather stupid mistakes.

Thanks

If you look at the docs, the shape of h_n is (num_directions*num_layers, batch_size, hidden_dim). This means that

c_n_merged = c_n.reshape(self.batch_size, -1)

will mess up your data (see also here). Also, you want to use h_n not c_n for further processing.

# Separate num_layers and num_directions
h_n = h_n.view(num_layers, num_directions, batch_size, hidden_dim)

# Get last hidden state w.r.t. number of layers
h_last = h_n[-1]

# Handle both direction be concatenating the 2 respective last hidden states
h_last = torch.cat((h_last[0], h_last_[1]), 1)

Now h_last should have a shape of (batch_size, 2*hidden_dim). This means that you then also need to change the definition of the first linear layer to:

self.linear_layer_1 = nn.Linear(2*self.hidden_dim, self.in_4_dim)

I’m actually surprised that your code doesn’t through an error, but I might very well have missed something.

Hey Chris,

thank you so much for pointing my misconceptions and mistakes out. You are absolutely right, that the reshaping should cause errors. I happily implemented your suggestions.

The dimensions should have been correct from the get go, but surely my code was not very comprehensible.

I made some changes meanwhile like adding layer-normalization after the bidirectional LSTM. For some reason one of the directions of the LSTM started already to react differently to different inputs. But the other direction still gives the same cell states for any input, or at least extremely similar (sometimes there is one or the other sample which deviates).

I assume this could have something to do with the input to the network, but I am not really sure if this is the case. I use BCELoss, Adam optimizer with a learning rate of 0.0001 and embeddings by word2vec with dimension 300. Further I also use now random initial weights and initialized them as learnable parameters.

The updated code looks as follows:

class Bi_RNN(nn.Module):
   """"
   Embedding Dim 300
   """
   def __init__(self, hidden_dim_lstm, in_2_dim, in_3_dim, in_4_dim, input_dim=300, output_dim=32, num_layers=1, batch_size=1):
       super(Bi_RNN, self).__init__()
       
       self.input_dim = input_dim
       self.hidden_dim_lstm_conc = hidden_dim_lstm*2*num_layers
       self.hidden_dim_lstm = hidden_dim_lstm
       self.batch_size = batch_size
       self.num_layers = num_layers
       self.in_2_dim = in_2_dim
       self.in_3_dim = in_3_dim
       self.act = nn.PReLU()
       self.num_directions = 2

       # Define Layer Norm
       self.layer_norm = nn.LayerNorm(hidden_dim_lstm)

       # Define the LSTM layer
       self.lstm = nn.LSTM(self.input_dim, self.hidden_dim_lstm, self.num_layers, batch_first=True, bidirectional=True)
       self.c_0 = nn.Parameter(torch.randn(self.num_directions, batch_size, hidden_dim_lstm), requires_grad=True)
       self.h_0 = nn.Parameter(torch.randn(self.num_directions, batch_size, hidden_dim_lstm), requires_grad=True)

       # Define the FFN
       self.linear_layer_1 = nn.Linear(self.hidden_dim_lstm_conc, self.in_2_dim)
       self.linear_layer_2 = nn.Linear(self.in_2_dim, self.in_3_dim)
       self.linear_layer_last = nn.Linear(self.in_3_dim, output_dim)  
       
   def init_hidden(self):
       # This is what we'll initialise our hidden state as
       #device = next(self.parameters()).device.type
       return (self.h_0, self.c_0)
   
   def forward(self, input):
       input = input.reshape(self.batch_size, *input.shape[1:])

       # Forward pass through LSTM layer
       # shape of lstm_out: [batch_size, input_size ,hidden_dim]
       # shape of self.hidden: (a, b), where a and b both
       # have shape (batch_size, num_layers, hidden_dim).
       lstm_out, self.hidden = self.lstm(input, self.init_hidden())
       h_n, c_n = self.hidden
       h_n = self.layer_norm(h_n)
       h_n_conc = torch.cat((h_n[0], h_n[1]), 1)
   
       layer_1_out = self.act(self.linear_layer_1(h_n_conc))
       layer_2_out = self.act(self.linear_layer_2(layer_1_out))
       out = self.linear_layer_last(layer_2_out)
       out = torch.sigmoid(out)

       return out

If you have any ideas why the one direction of the lstm still produces the same outputs I’d be happy to hear them.

Thank you already so much again,
Thomas

What is the initial shape of input in the forward() method? I’m therefore also not clear what

input = input.reshape(self.batch_size, *input.shape[1:])

is doing, or why it is needed in the first place.