Same LSTM(GRU) implementation different results (pytorch& keras)

hi i am working about time series data. i have a problem that confused me. i am tuned a neural network with same implementation in both keras and pytorch but had different result.

This is not the only problem. The keras model always gives the same results (Every time I do train model). But the Pytorch model gives the results in 10% of the cases consistent with the cross model. And most of the time it has very bad results that I put(And of course not like the results of keras).
Please guide me. thankssssssssssssss

keras model:

model_input = keras.Input(shape=(x_train_T.shape[1], 8))
x_1 = layers.GRU(75,return_sequences=True)(model_input)
x_1 = layers.GRU(90)(x_1)
x_1 = layers.Dense(95)(x_1)
x_1 = layers.Dense(15)(x_1)
model = keras.models.Model(model_input, x_1)
model.compile(optimizer= adam_optim, loss= "mse" , metrics='accuracy')
model.fit(x_train_T, y_train, batch_size=1, epochs = 100)

pytorch model:

class GRU(nn.Module):
    def __init__(self,input_size, hidden_size_1, hidden_size_2, hidden_size_3, output_size, num_layers, device):
        super(GRU, self).__init__()
        self.input_size = input_size
        self.hidden_size_1 = hidden_size_1
        self.hidden_size_2 = hidden_size_2
        self.hidden_size_3 = hidden_size_3
        self.num_layers = num_layers
        self.device = device
        
        self.gru_1 = nn.GRU(input_size, hidden_size_1, num_layers, batch_first=True)
        self.gru_2 = nn.GRU(hidden_size_1, hidden_size_2, num_layers, batch_first=True)
        self.fc_1 = nn.Linear(hidden_size_2, hidden_size_3)
        self.fc_out = nn.Linear(hidden_size_3, output_dim)

    def forward(self, x):
        input_X = x
        h_1 = torch.zeros(self.num_layers, input_X.size(0), self.hidden_size_1, device=self.device)
        h_2 = torch.zeros(self.num_layers, input_X.size(0), self.hidden_size_2, device=self.device)

        out_gru_1 , h_1 = self.gru_1(input_X, h_1)
        out_gru_2 , h_2 = self.gru_2(out_gru_1, h_2) 
        out_Dense_1 = self.fc_1(out_gru_2[:,-1,:]) 
        out_Dense_out = self.fc_out(out_Dense_1)

        return out_Dense_out
##############################
input_dim = 8
hidden_dim_1 = 75
hidden_dim_2 = 90
hidden_dim_3 = 95
num_layers = 1
output_dim = 15
num_epochs = 100

model = GRU(input_size=input_dim, hidden_size_1 = hidden_dim_1, hidden_size_2 = hidden_dim_2, hidden_size_3 = hidden_dim_3,output_size = output_dim, num_layers=num_layers, device = device)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

import time
for t in range(num_epochs ):
  start_time = time.time()
  loss_p = []
  for i in range(x_train_T.size(0)):
    inputs, target = x_train_T[i:i+1] , y_train[i:i+1]
    inputs = torch.tensor(inputs, dtype=torch.float32).to(device)
    target = torch.tensor(target, dtype=torch.float32).to(device)
    y_train_pred = model(inputs)

    loss_ = criterion(y_train_pred, target)

    optimizer.zero_grad()
    loss_.backward()
    optimizer.step()

    loss_p.append(loss_)
  loss_p = np.array(loss_p)
  loss_P = loss_p.sum(0)/loss_p.shape[0] 
  end_time = time.time()
  print("Epoch ", t, "MSE: ", loss_P.item() , "///epoch time: {0} seconds".format(round(end_time - start_time, 2)))
##############################

In rare cases, the loss result of both starts at approximately 0.09 and ends at approximately 0.015.
In most cases, the losses is the same for the keras model , but for the pytorch it stays at 0.08.

i.e , sometimes Pytorch is trained and sometimes not

i think should initialize pytorch layers as same as the keras layers.
but how???

lstm initialization in the keras is as follows:

def __init__(units, activation='tanh', recurrent_activation='sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False, time_major=False, reset_after=True, **kwargs)

and linear layers:

def __init__(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs)

how i initializing layers in pytorch???

You can use the torch.nn.init methods to initialize the parameters as shown in e.g. this post.

1 Like

Hi @ptrblck : As a small follow up question on that, I was wondering how we could use an if statement to initialize the kernel and recurrent base separately for an LSTM in Pytorch as keras has orthogonal initialization for the recurrent layers and glorot for the kernel

Something like

 def init_weights(model):
    if type(model) == nn.RNNBase:
       # Do Orthogonal
    elif type(model) == nn.Linear:
       # Do glorot

rnn = nn.LSTM(10, 20, 2)
rnn.apply(init_weights) 

I was trying to understand the right conditions from torch.nn.modules.rnn — PyTorch 1.8.1 documentation but wasn’t able to figure it out.

If I understand the use case correctly, you would like to use different init methods for the internal nn.LSTM parameters. In that case you could use a similar approach as seen here:

lstm = nn.LSTM(1, 1)
torch.nn.init.xavier_uniform_(lstm.weight_ih_l0)

Let me know, if this works for you.

1 Like

Thanks this works, not sure about its impact on training yet but the reference to .weight_ih_l0 is what I was looking for

Initialization did not solve the problem either.
1- What is in the kreas model that is not in the pytorch model?
2- Did I write the pytorch model correctly according to the keras model?
3- What is your advice to solve this problem?
Loss decreases from 0.1 to 0.01 in the keras model, but stops at 0.08 in the pytorch model (most of the time).

  1. I don’t know how exactly the Keras model is working, so cannot give you a proper answer. E.g. could you explain what the difference in the outputs for the Keras GRU would be if return_sequences is either set to True or False? Also, do the Dense layers automatically apply an activation function without you specifying it? If so, you might want to add this non-linearity to the PyTorch model as well.

  2. Same as number 1.

  3. To properly debug the difference you could store all parameters of the Keras model, load them into the PyTorch model, and compare the outputs of both. If the difference for the same input is larger than the expected error due to the floating point precision, you could then compare the outputs of each layer and narrow down the difference further.