Loss does not descent in pytorch, but does in tensorflow

pytorch code:

def test_torch():
    import torch
    import torch.nn as nn
    from torch.utils import data
    from torch.utils.data import dataloader
    data = torch.randn(100000, 2)
    y = torch.sin(data[:, 0]*data[:, 1]).unsqueeze(1)
    print(y.shape)
    data = torch.cat([data, y], dim=1)

    model = nn.Sequential(
        nn.Linear(2, 100),
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 100), 
        nn.ReLU(),
        nn.Linear(100, 1)
    )

    for p in model.parameters():
        print(p.shape, p.requires_grad)

    criterion = torch.nn.L1Loss()
    opt = torch.optim.SGD(model.parameters(), 0.001)

    dl = dataloader.DataLoader(data, batch_size=500)
    for epoch in range(100):
        for d in dl:
            pred = model(d[:, :2])
            loss = criterion(pred, d[:, 2])
            opt.zero_grad()
            loss.backward()
            opt.step()
            print("epoch:[{}], loss: {}".format(epoch, loss.item()))

tensorflow version:

def test_tf():
    import tensorflow as tf
    import numpy as np
    from tensorflow import keras
    model = tf.keras.Sequential([
        keras.layers.Dense(units=10, activation='relu', input_shape=[2]),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=10, activation='relu'),
        keras.layers.Dense(units=1),
    ]
    )
    model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mean_squared_error")
    xs = np.random.randn(100000, 2).astype(np.float32)
    ys = np.sin(xs[:,0] * xs[:, 1])
    model.fit(xs, ys, epochs=100, batch_size=500)

I think it is not a fair comparison. Model architectures are different (the number of units in Keras dense layer does not match the PyTorch linear layer), and optimizers are not the same either(SGD vs Adam). Please fix these two issues, hope you get the same result.

Thank you for your answer!
It’s a pity you didn’t try to run the code. In fact I have done more tests on the above code and it just does not converge, even when I reduce the problem to a identity mapping.
As for the unfairness you mentioned, it was just a small mistake that I didn’t notice.

Did you try to fix the issues and rerun the code?
As described before, the posted approaches use another model, optimizer, and criterion, while only the data seems to be equal.

Got it !
There is a bug in my code:
loss = criterion(pred, d[:, 2])
What’s more, SGD needs larger learning rate.

Thanks for all !