Proper way to combine linear layer after LSTM


I have implemented a simple word generating network using a LSTMCell coupled with a Linear layer which works perfectly. I now want to use the LSTM class to be able to process the data in batches in order to go faster.

The same architecture with an LSTM object instance + Linear output layer produces outer nonsense. I figured out that this might be due to the fact that LSTM expects the arguments in the order (seq_lenth, batch_dim, input_dim) whereas the linear layers want (batch_dim,seq_lengnth,input_length) and changed the lstm in my code to batch_first = True but still get the same results.
Could it be that the LSTM still outputs in the format seq_len,batch_dim,input_size even if we pass batch_first = True as an argument ?

I’ve searched a bit on the forum and the answers always suggest to reshape the output of the LSTM before passing it to the linear layer, which I find cumbersome but maybe there is no way around in PyTorch.

If you’ve specified batch_first = True, then the output will be of the format (batch, seq_length, num_directions * hidden_size).

From the documentation -

1 Like

Yes that is what I thought … I also controlled by looking carefully at the dimension of each forward step. Then I don’t know what is wrong. Do you see any flaw in my design ? Here is the model :

class pytorchLSTM(nn.Module):
    def __init__(self,input_size,hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True)
        self.output_layer = nn.Linear(hidden_size,input_size)
        self.tanh = nn.Tanh()
        self.softmax = nn.LogSoftmax(dim = 2)
    def forward(self, input, hidden = None):
        if hidden == None:
            hidden = (torch.zeros(1,1,self.hidden_size),torch.zeros(1,1,self.hidden_size))
            out, hidden = self.lstm(input,hidden)
            out = self.tanh(out)
            out = self.output_layer(out)
            out = self.softmax(out)
            out, hidden = self.lstm(input,hidden)
            out = self.tanh(out)
            out = self.output_layer(out)
            out = self.softmax(out)
        return out, hidden

the inputs are (1 x seq_length x input_length) tensors corresponding to the one-hot-encoded letters of a word. same for the target. There is of course a start and end token.
Here is the training loop :

def train_rnn(model):
    criterion = nn.NLLLoss()
    optimizer = torch.optim.Adam(model.parameters())
    n_iters = 10000

    for iter in range(1,n_iters+1):
        #chooses a word randomly in the data
        word = randomChoice(words)
        #transforms the word into a (1 x seq_length x input_length) tensor of one-hot encoded vectors.
        input_tensor = inputTensor(word)
        #target is the same word as input but one step after.
        target_tensor = targetTensor(word).unsqueeze(-1)
        loss = 0
        output, hidden = model(input_tensor)
        for i in range(input_tensor.size(1)):
            l = criterion(output[0][i].unsqueeze(0), target_tensor[i])
            loss += l
1 Like

This looks correct to me, what errors are you getting?

Edit : Also you dont need to define hidden when you have no previous hidden state and cell state, when they aren’t provided, they default to 0.

I am not getting an error but I am getting nonsensical results. Here is an example of words the I get with this network :


This is in stark contrast with this model (which has excactly the same architecture) based on LSTMCell :

class lstmCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.cell = nn.LSTMCell(input_size,hidden_size)
        self.softmax = nn.LogSoftmax(dim = 1)
        self.tanh = nn.Tanh()
        self.output_layer = nn.Linear(hidden_size,input_size)
    def forward(self, input, hidden):
        hidden, context = self.cell(input, hidden)
        out = self.tanh(hidden)
        out = self.softmax(self.output_layer(out))
        return out, (hidden, context)

where i get output like (in this case it was trained with names) :


In the loss calculation you’re comparing output to target_line_tensor instead of target_tensor, is that intentional? What is target _line _tensor?

woops, I made a small mistake in transferring the code here. It is actually target_tensor.

I would recommend the following-

  1. Printing out values of out and target tensor and ensure that you are comparing the right values.
  2. Trying to overfit to one (or a few) training example.

If you’re not able to overfit to a few examples, there’s probably something wrong with the code.

okay I will try that tomorrow. Thanks for the tips