Beginner question, migrate keras to pytorch

Hi, I have a very simple LSTM example written in Keras that I am trying to port to pytorch. But it does not seem to be able to learn at all. I am an absolute beginning so any advice is appreciated.


X_train_lmse has shape (1691, 1, 1), I am essentially running X(t) with X(t-1) as single feature

lstm_model = Sequential()
lstm_model.add(LSTM(7, input_shape=(1, X_train_lmse.shape[1]), activation='relu', kernel_initializer='lecun_uniform', return_sequences=False))
lstm_model.compile(loss='mean_squared_error', optimizer='adam')
early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1)
history_lstm_model =, y_train, epochs=100, batch_size=1, verbose=1, shuffle=False, callbacks=[early_stop])

Epoch 1/100
1691/1691 [==============================] - 10s 6ms/step - loss: 0.0236
Epoch 2/100
1691/1691 [==============================] - 9s 5ms/step - loss: 0.0076
Epoch 3/100


X_train_tensor has same shape as in keras (1691, 1, 1). I am specifying batch_first to be true below so I think it should be ok.

class LSTM_model(nn.Module):

    def __init__(self):
        super(LSTM_model, self).__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=7, num_layers=1, batch_first=True)
        self.dense = nn.Linear(7, 1)

    def forward(self, x):
        out, states = self.lstm(x)
        out = self.dense(out)
        return out

lstm_model = LSTM_model()
loss_function = nn.MSELoss()
optimizer = optim.Adam(lstm_model.parameters())

for t in range(100):
    y_pred = lstm_model(X_train_tensor)
    loss = loss_function(y_pred, Y_train_tensor)
    print('Train Epoch ', t, ' Loss = ', loss)

Train Epoch 0 Loss = tensor(0.2834, grad_fn=)
Train Epoch 1 Loss = tensor(0.2812, grad_fn=)
Train Epoch 2 Loss = tensor(0.2790, grad_fn=)
Train Epoch 3 Loss = tensor(0.2768, grad_fn=)
Train Epoch 4 Loss = tensor(0.2746, grad_fn=)
Train Epoch 5 Loss = tensor(0.2725, grad_fn=)
Train Epoch 6 Loss = tensor(0.2704, grad_fn=)
Train Epoch 7 Loss = tensor(0.2683, grad_fn=)

As you can see, the error barely moves in Pytorch. Also each epoch runs much much faster than keras.

I must be doing something stupid. I checked the input data and they look identical in both implementations. Thanks!

The problem lies within your iterations per epochs: in your keras code, you do 100 epochs with a batchsize of 1 and a dataset of size 1691, resulting in a total number of 1691*100 updates.
In Pytorch you do 100 epochs with only 1 update each, since you calculate the gradient on the whole dataset (which you forwarded all at once). To reproduce the behavior, you’d have to iterate over the first dimension of your tensor too.

yes, that is exactly what the problem was. Thanks a lot for the suggestion!