Confusion regarding PyTorch LSTMs compared to Keras stateful LSTM

Hi all,

I’m trying to train a network with LSTMs to make predictions on time series data with long sequences. The sequence length is too long to be fed into the network at once and instead of feeding the entire sequence I want to split the sequence into subsequences and propagate the hidden state to capture long term dependencies. I’ve done this successfully before with Keras passing the ‘stateful=True’ flag to the LSTM layers, but I’m confused about how to accomplish the same with PyTorch.

In particular, I’m not sure how to keep and propagate the hidden states when feeding subsequences of a longer sequence as batches. What i’ve tried so far is to do something like:

    def forward(self, batch_data):
        self.hidden = [Variable( for h in self.hidden]
        lstm_out, self.hidden = self.lstm1(batch_data, self.hidden)            
        y_pred = self.sigmoid(self.fc1(lstm_out[:,-1]))

to maintain the hidden state values, between batches and then set them to zero when starting on a new sequence.

I’ve written up a notebook to illustrate what I’m to trying achieve:

The model is based off of

I have the same question. Making an LSTM stateful in Keras requires just setting the stateful = True parameter when creating it. However, in Pytorch since the hidden states have to be managed manually, I’m unclear on what to do as I face errors when doing it manually.

My procedure is roughly:

  1. Reset hidden state at the beginning of each epoch, and run steps 2-4 for each minibatch:
  2. Run the LSTM forward pass for a minibatch, with lstm_out, self.hidden = self.lstm(input, self.hidden)
  3. Call backward(retain_graph=True)
  4. Call optimizer.step() and zero out gradients

However, when I call optimizer.step() then it throws an inplace operation error. I’m inclined to believe it’s the optimizer that’s causing the inplace error as running steps 1-3 without the optimizer results in no problem at all.

My questions, thus, are:

  • Is there something that I’m missing in making a stateful LSTM here?
  • Which of the following is equivalent to calling the LSTM with the entire sequence (since it’s too long to fit in memory):
    • Running the forward pass (step 2) over the entire sequence but calling loss.backward() for only the last loss
    • Calling loss.backward() after the forward pass (i.e. calling steps 2 and 3 for each batch) but calling optimizer.step() (step 4) at the end of each epoch?

I think the second option is equivalent to running it over the entire batch, but I’m not sure. Can anyone help clarify my thoughts on the matter?

I was running into a similar issue with Pytorch vs Keras. But then I realized that in Keras when you set stateful=True, you are essentially making a longer sequence of your data with batch size=1

For example, say X is of shape B,L,H where B is the batch size, L is the sequence length, and H is the hidden dim, then in Keras LSTM with stateful=True, this will be same as having a batch size of 1 and concatenating one by one all the seq. lengths so they will now be of length BL, i.e. input X is now of shape 1,LB,H

And so by reshaping your input data, you get the same behavior. And this can be done easily in Pytorch.

And in theory there should be no difference in space and time complexity between the two approaches because once you set Stateful=True in Keras, it will have to sequentially process each batch one at a time starting from batch 0 to batch B (i.e. it can’t process the batches in parallel anymore) because you need the final hidden state from batch b0 as initial hidden state for batch b1, and so forth for subsequent batches.

Hope this helps any future reader running into this.


@amitp-ai Thank you for the information. I made a basic network trying to do what you propose for a LSTM stateful in pytorch. Could you tell me if this is what keras does??

import torch

# Custom Dataset
class TensorDataset(
    def __init__(self, TensorX,TensorY):
        self.TensorX = TensorX
        self.TensorY = TensorY
    def __len__(self):
        return self.TensorX.shape[0]
    def __getitem__(self,idx):
        return (self.TensorX[idx],self.TensorY[idx])

# Model = Stateful LSTM+linear
class LSTM(nn.Module):
    def __init__(self, input_size,hidden_size,output_size):
        super(LSTM, self).__init__()
        self.lstm = torch.nn.LSTM(batch_first=True,input_size=input_size,hidden_size=hidden_size)
        self.linear = torch.nn.Linear(in_features=hidden_size, out_features=output_size)
    def forward(self, x, hn, cn):
        # Stateful
        x_longer = x.view(1,x.shape[0]*x.shape[1],x.shape[2])
        out_longer, (hn, cn) = self.lstm(x_longer, (hn.detach(), cn.detach()))
        out = out_longer.view(x.shape[0],x.shape[1],out_longer.shape[2])
        out = self.linear(out[:,-1,:])
        return out.unsqueeze(-1), (hn, cn)

N_epochs = 10000
hidden_size = 2
features = 1
learning_rate = 0.001
output_size = 1
model = LSTM(input_size=features,hidden_size=hidden_size,output_size=output_size)#Create model
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)#optimizer
criterion = torch.nn.MSELoss() # loss
# Create dataset: Imagine original_batch_size=2
x = torch.tensor([[1.0, 2.0, 3.0],[4.0, 5.0, 6.0],[7.0, 8.0, 9.0],[10.0, 11.0, 12.0]]).unsqueeze(-1)
y = torch.tensor([[4.],[7.],[10.],[13.]]).unsqueeze(-1)
dataset = TensorDataset(x,y)
dataloader =,batch_size=batch_size)
# Training
for epoch in range(0,N_epochs):
    # Create first hidden and cell state with batch=1 
    hn = torch.zeros(1, 1, hidden_size)#[num_layers*num_directions,batch,hidden_size]
    cn = torch.zeros(1, 1, hidden_size)#[num_layers*num_directions,batch,hidden_size]
    for x,y in dataloader:
        out, (hn,cn) = model(x,hn,cn)
        loss = criterion(out,y)
        loss.backward()# Backward
        optimizer.step()# gradient descent on adam step

I also did an in spyder debug for the first epoch, just to see the size of the tensors. I attach an image in case it is useful to someone else (the variable “out” is before using out = self.linear(out[:,-1,::]))

@ deividbotina I am not sure of Keras’ internal implementation; but in terms of behavior, your code looks reasonable to me. And similar to Keras’ behavior.

1 Like

Thank you for your explanation, which is very helpful for me to understand the principle :grin:

Should in the cell and hidden state initialization the batch be 2 instead of 1, in order to match with the dataloader batch size?