I have implemented a simple word generating network using a LSTMCell coupled with a Linear layer which works perfectly. I now want to use the LSTM class to be able to process the data in batches in order to go faster.

The same architecture with an LSTM object instance + Linear output layer produces outer nonsense. I figured out that this might be due to the fact that LSTM expects the arguments in the order (seq_lenth, batch_dim, input_dim) whereas the linear layers want (batch_dim,seq_lengnth,input_length) and changed the lstm in my code to batch_first = True but still get the same results.
Could it be that the LSTM still outputs in the format seq_len,batch_dim,input_size even if we pass batch_first = True as an argument ?

I’ve searched a bit on the forum and the answers always suggest to reshape the output of the LSTM before passing it to the linear layer, which I find cumbersome but maybe there is no way around in PyTorch.

Yes that is what I thought … I also controlled by looking carefully at the dimension of each forward step. Then I don’t know what is wrong. Do you see any flaw in my design ? Here is the model :

class pytorchLSTM(nn.Module):
def __init__(self,input_size,hidden_size):
super().__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size, batch_first = True)
self.output_layer = nn.Linear(hidden_size,input_size)
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim = 2)
def forward(self, input, hidden = None):
if hidden == None:
hidden = (torch.zeros(1,1,self.hidden_size),torch.zeros(1,1,self.hidden_size))
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
else:
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
return out, hidden

the inputs are (1 x seq_length x input_length) tensors corresponding to the one-hot-encoded letters of a word. same for the target. There is of course a start and end token.
Here is the training loop :

def train_rnn(model):
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
n_iters = 10000
for iter in range(1,n_iters+1):
#chooses a word randomly in the data
word = randomChoice(words)
#transforms the word into a (1 x seq_length x input_length) tensor of one-hot encoded vectors.
input_tensor = inputTensor(word)
#target is the same word as input but one step after.
target_tensor = targetTensor(word).unsqueeze(-1)
optimizer.zero_grad()
loss = 0
output, hidden = model(input_tensor)
for i in range(input_tensor.size(1)):
l = criterion(output[0][i].unsqueeze(0), target_tensor[i])
loss += l
loss.backward()
optimizer.step()