Code supposed to be similar but different result

I have these two codes that are supposed to give the same result.

import torch

torch.manual_seed(1)
D_in, D_out, H, N = 1000, 10, 100, 100

dtype = torch.float64
device = torch.device('cpu')

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)
model.double()
loss_fn = torch.nn.MSELoss(reduction='sum')

l = 1e-6
for i in range(500):
    y_pred = model(x)

    loss = loss_fn(y_pred, y)
    print(i, loss.item())

    model.zero_grad()
    loss.backward()
    # grad_w_1 = 2.0*(y_pred-y).dot(w_2.T).T.dot(x).T
    # grad_w_1[grad_w_1 < 0] = 0
    with torch.no_grad():
        for param in model.parameters():
            param -= l * param.grad

and

import torch

torch.manual_seed(1)
D_in, D_out, H, N = 1000, 10, 100, 100

dtype = torch.float64
device = torch.device('cpu')

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

w_1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w_2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

l = 1e-6
for i in range(500):
    h = x.mm(w_1) #NxH
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w_2) #NxD_out

    loss = (y_pred - y).pow(2).sum()
    print(i, loss.item())

    loss.backward()
    # grad_w_1 = 2.0*(y_pred-y).dot(w_2.T).T.dot(x).T
    # grad_w_1[grad_w_1 < 0] = 0
    with torch.no_grad():
        w_1 -= l * w_1.grad
        w_2 -= l * w_2.grad
        w_1.grad.zero_()
        w_2.grad.zero_()

The first one’s result is:

...
497 791.8494193212754
498 791.394974676931
499 790.9410212657938

while the second one’s is:

...
497 398.63481097164174
498 396.12781489790547
499 393.64120085253165

How do I solve this?

Thanks

Hi,

The difference I see between the two is that the one with Linear layers include some bias while the other one does not.
Also when you say it works for float32. It is only for that random seed I guess no?

Right sorry about the float32. Let’s ignore it and I have edited the question.

I have changed part of the code into:

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=False),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=False)
)

But still get different result.

497 802.9330597078987
498 802.4759949553443
499 802.0193235987608

Is there anything that I still did wrong here? Thanks

Have you tried running both with multiple random seed? Do you see the same differences for a single method with different seeds?

I saw that both methods have a decreasing loss with different seeds. I think both are correct, right?

I think it is because the weights are initialized differently. There’s no solution to have a similar results (or even close to it).

Yes, that’s what I was expecting. Given that they do different things with different initialization, there is no reason they should converge to the same thing. And potentially some get to better local minima that others.