The loss in PyTorch decreases slower than in Keras

I’m training a LSTM model using Adam optimizer. The input size is (10000,48), and the output size is (10000,16). Here is the original code keras model:

keras_model = Sequential()
keras_model.add(Embedding(16, 10, input_length=48))
keras_model.add(CuDNNLSTM(50))
keras_model.add(Dropout(0.1))
keras_model.add(Dense(16, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

train_history = keras_model.fit(X_train,
                                y_train,
                                epochs=20,
                                verbose=1,
                                shuffle=False,
                                batch_size=256)

The following is the PyTorch model:

def hard_sigmoid(x):
    """
    Computes element-wise hard sigmoid of x.
    """
    x = (0.2 * x) + 0.5
    x = F.threshold(-x, -1, -1)
    x = F.threshold(-x, 0, 0)
    return x

class LSTM(nn.Module):

    def __init__(self):
        super().__init__()

        self.embedding = nn.Embedding(16, 10)
        self.lstm = nn.LSTM(10, 50)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(50, 16)

    def forward(self, x):
        x = self.embedding(x)
        x, _  = self.lstm(x, None)
        x = x[:,-1,:]
        x = self.dropout(x)
        x = self.fc(x)
        x = hard_sigmoid(x)
        return x

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = LSTM(10, 50, 16, 16).to(device)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

training_set = TensorDataset(X_train, y_train)
train_loader = DataLoader(training_set, batch_size=256, shuffle=False)
test_set = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_set, batch_size=256, shuffle=False)

def Train_LSTM(num_epochs):
    hist = np.zeros(num_epochs)
    for t in range(num_epochs):
        running_loss = 0.0
        model.train()
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            y_pred = model(inputs)
            loss = criterion(y_pred, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print("Epoch ", t+1, "Training loss: ", running_loss/len(train_loader))
        hist[t] = running_loss

However, the training loss for these two models differ. Here is the result for Keras model:

Epoch 1/20 loss: 0.3668 - acc: 0.8220
Epoch 2/20 loss: 0.3186 - acc: 0.8366
Epoch 3/20 loss: 0.2893 - acc: 0.8529
Epoch 4/20 loss: 0.2680 - acc: 0.8673
Epoch 5/20 loss: 0.2461 - acc: 0.8818
Epoch 6/20 loss: 0.2292 - acc: 0.8918
Epoch 7/20 loss: 0.2153 - acc: 0.8994
Epoch 8/20 loss: 0.2037 - acc: 0.9053
Epoch 9/20 loss: 0.1955 - acc: 0.9098
Epoch 10/20 loss: 0.1884 - acc: 0.9134
Epoch 11/20 loss: 0.1804 - acc: 0.9173
Epoch 12/20 loss: 0.1749 - acc: 0.9198
Epoch 13/20 loss: 0.1705 - acc: 0.9218
Epoch 14/20 loss: 0.1655 - acc: 0.9242
Epoch 15/20 loss: 0.1600 - acc: 0.9267
Epoch 16/20 loss: 0.1580 - acc: 0.9275
Epoch 17/20 loss: 0.1537 - acc: 0.9292
Epoch 18/20 loss: 0.1506 - acc: 0.9306
Epoch 19/20 loss: 0.1485 - acc: 0.9315
Epoch 20/20 loss: 0.1442 - acc: 0.9332

Here is the result for PyTorch model:

Epoch  1 Training loss:  0.3904338072785331
Epoch  2 Training loss:  0.3710213891228142
Epoch  3 Training loss:  0.3628670514163459
Epoch  4 Training loss:  0.3596793907453947
Epoch  5 Training loss:  0.35954090678478445
Epoch  6 Training loss:  0.35725244792068706
Epoch  7 Training loss:  0.3549379930852929
Epoch  8 Training loss:  0.35340563144982623
Epoch  9 Training loss:  0.35231793117340265
Epoch  10 Training loss:  0.349751780214517
Epoch  11 Training loss:  0.3518899157452766
Epoch  12 Training loss:  0.3493136685065296
Epoch  13 Training loss:  0.3415819657656848
Epoch  14 Training loss:  0.33805217672034604
Epoch  15 Training loss:  0.32990270841609487
Epoch  16 Training loss:  0.3201482021595206
Epoch  17 Training loss:  0.3118191622483456
Epoch  18 Training loss:  0.305637656117949
Epoch  19 Training loss:  0.29569833011121094
Epoch  20 Training loss:  0.3000877904884346

I’m not sure how to fix this problem. Any suggestion would be helpful.

Is it possible that the weight initialization is different? Why would this cause huge performance difference?