Multi-step time series forecasting

I’m currently developing a multi-step time series forecasting model by using a GRU (or also a bidirectional GRU). The idea is to use this model to infer the temperature of the next 2 months given the previous three (I have the daily temperature starting from 1995 till 2020 → dataset).
However, while doing training the loss after the first epoch, get stuck and neither decrease nor increase for all the remaining epochs.

Train Loss : 0.548
Train Loss : 0.548
Train Loss : 0.548
Train Loss : 0.548
Train Loss : 0.548

Since I was not able to find any error I developed the same application in Keras, using the same hyperparameters, the same input preprocessing (including normalization) the same loss (L1Loss or MAE), and the same n.of layers. The only thing that changes is that in PyTorch I used a GRU while in Keras an LSTM with a ‘ReLU’ activation function in each LSTM cell. However, everything goes well in Keras and the loss decreases normally.

Epoch 1/50
284/284 [==============================] - 35s 114ms/step - loss: 0.1528
Epoch 2/50
284/284 [==============================] - 31s 110ms/step - loss: 0.0922
Epoch 3/50
284/284 [==============================] - 33s 115ms/step - loss: 0.0845
Epoch 4/50
284/284 [==============================] - 30s 107ms/step - loss: 0.0740
Epoch 5/50
284/284 [==============================] - 32s 114ms/step - loss: 0.0709

As follows I report the GRU in Pytorch, followed by the one in Keras.

class GRU_net(nn.Module):
    def __init__(self,hidden_dim, num_layers, output_size, drop_prob=0.0):
        super(GRU_net, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.output_size = output_size
        self.gru = nn.GRU(input_size = 1,hidden_size=hidden_dim, num_layers = num_layers, bidirectional=False, batch_first=True, dropout=drop_prob)      
        #fully connected layers
        self.fc = nn.Linear(hidden_dim*1,output_size)

    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.num_layers*1, batch_size, self.hidden_dim, device=torch.device(device)) 
        return hidden  
    def forward(self,x):
        batch_size = x.size(0)
        hidden = self.init_hidden(batch_size)
        gru_out, h = self.gru(x, hidden)
        gru_out = gru_out[:, -1, :]
        out = self.fc(gru_out)
        return out

hidden_dim = 64
num_layers = 2
output_size = n_steps_out

gru_net = GRU_net(hidden_dim, num_layers, output_size).to(device)


model = Sequential()
model.add(LSTM(64, activation='relu', return_sequences=True, input_shape=(n_steps_in,n_features)))
model.add(LSTM(64, activation='relu')) 

Furthermore, I checked each shape coming out from each layer and they are the same between the Pytorch and the Keras implementation.

Thank you in advance for the help.

I’m unsure how to interpret the Keras model but aren’t you using two LSTM layers while the PyTorch model uses a single GRU? How did you compare all intermediate activations?

At the moment I am using GRU also in the Keras implementation, and it works just fine as LSTM does for this task but, for the moment I copy and paste the code I used LSTM. Sorry about that.

However, in Keras two layers are stacked one after the other, but is not this the same result obtained in Pytorch when num_layers=2 is set in the GRU?
Could be the problem caused by the fact that LSTM/GRU in Keras is set with “ReLU” activation while in PyTorch the “tanh” activation is used?
As far as I know, the only way to change the activation function in LSTM/GRU in PyTorch is to create a custom LSTM/GRU cell right?