Hi, I have a very simple LSTM example written in Keras that I am trying to port to pytorch. But it does not seem to be able to learn at all. I am an absolute beginning so any advice is appreciated.
X_train_lmse has shape (1691, 1, 1), I am essentially running X(t) with X(t-1) as single feature
lstm_model = Sequential() lstm_model.add(LSTM(7, input_shape=(1, X_train_lmse.shape), activation='relu', kernel_initializer='lecun_uniform', return_sequences=False)) lstm_model.add(Dense(1)) lstm_model.compile(loss='mean_squared_error', optimizer='adam') early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1) history_lstm_model = lstm_model.fit(X_train_lmse, y_train, epochs=100, batch_size=1, verbose=1, shuffle=False, callbacks=[early_stop])
1691/1691 [==============================] - 10s 6ms/step - loss: 0.0236
1691/1691 [==============================] - 9s 5ms/step - loss: 0.0076
X_train_tensor has same shape as in keras (1691, 1, 1). I am specifying batch_first to be true below so I think it should be ok.
class LSTM_model(nn.Module): def __init__(self): super(LSTM_model, self).__init__() self.lstm = nn.LSTM(input_size=1, hidden_size=7, num_layers=1, batch_first=True) self.dense = nn.Linear(7, 1) def forward(self, x): out, states = self.lstm(x) out = self.dense(out) return out lstm_model = LSTM_model() loss_function = nn.MSELoss() optimizer = optim.Adam(lstm_model.parameters()) for t in range(100): y_pred = lstm_model(X_train_tensor) loss = loss_function(y_pred, Y_train_tensor) optimizer.zero_grad() loss.backward() optimizer.step() print('Train Epoch ', t, ' Loss = ', loss)
Train Epoch 0 Loss = tensor(0.2834, grad_fn=)
Train Epoch 1 Loss = tensor(0.2812, grad_fn=)
Train Epoch 2 Loss = tensor(0.2790, grad_fn=)
Train Epoch 3 Loss = tensor(0.2768, grad_fn=)
Train Epoch 4 Loss = tensor(0.2746, grad_fn=)
Train Epoch 5 Loss = tensor(0.2725, grad_fn=)
Train Epoch 6 Loss = tensor(0.2704, grad_fn=)
Train Epoch 7 Loss = tensor(0.2683, grad_fn=)
As you can see, the error barely moves in Pytorch. Also each epoch runs much much faster than keras.
I must be doing something stupid. I checked the input data and they look identical in both implementations. Thanks!