Logically same code in torch vs. numpy converges very differently

I am running the Pytorch tutorial code given on this page.
It deals with creating a basic ReLU network trained with gradient descent to match a randomly generated output matrix.

The numpy code for the same is below:

import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

While the pytoch code is here:

# -*- coding: utf-8 -*-

import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

My issue comes from the fact that when I run the logically same code (from below the for loop) on two identical matrices - got from converting the matrices from numpy to torch and vice versa as shown below:

x = x_copy.numpy()
y = y_copy.numpy()
w1 = w1_copy.numpy()
w2 = w2_copy.numpy()

or

x = torch.from_numpy(x_copy)
y = torch.from_numpy(y_copy)
w1 = torch.from_numpy(w1_copy)
w2 = torch.from_numpy(w2_copy)

respectively, the convergence graph differs wildly.

The numpy code gives the following loss transition:

index: 0 loss: 32461886.920378435
index: 1 loss: 27637965.695966113
index: 2 loss: 24769309.536160268
index: 3 loss: 20669833.99258788
index: 4 loss: 15293372.088940833
index: 5 loss: 10089927.685404746
index: 6 loss: 6205566.543079788
index: 7 loss: 3762003.192960376
index: 8 loss: 2356852.802093134
index: 9 loss: 1570218.63804998
index: 10 loss: 1119167.3165900838
index: 11 loss: 846504.7681372062
index: 12 loss: 670056.7573582053
index: 13 loss: 547736.9839199611
index: 14 loss: 457611.8523979621
index: 15 loss: 388126.11921249115
index: 16 loss: 332694.39484438574
index: 17 loss: 287504.80613440264
index: 18 loss: 250030.65232854194
index: 19 loss: 218545.42765090568
index: 20 loss: 191837.42046243855
...
index: 480 loss: 3.4945624488474176
index: 481 loss: 3.494177549413921
index: 482 loss: 3.4938240015619826
index: 483 loss: 3.4934986465915605
index: 484 loss: 3.493199029305159
index: 485 loss: 3.492922758173682
index: 486 loss: 3.4926678679706566
index: 487 loss: 3.492432506069835
index: 488 loss: 3.4922150942465526
index: 489 loss: 3.492014171723262
index: 490 loss: 3.4918284643947173
index: 491 loss: 3.4916568040109386
index: 492 loss: 3.4911714932805262
index: 493 loss: 3.49062017671815
index: 494 loss: 3.4902068107170057
index: 495 loss: 3.4882589525388568
index: 496 loss: 3.4863645270607586
index: 497 loss: 3.48474983737141
index: 498 loss: 3.4831788206912506
index: 499 loss: 3.4810279686457593

While torch gives a vastly different end loss value:

index: 0 loss: 32461886.920378435
index: 1 loss: 27637893.213226713
index: 2 loss: 24769381.523720074
index: 3 loss: 20670463.973418493
index: 4 loss: 15293362.230150145
index: 5 loss: 10086647.535185618
index: 6 loss: 6201742.552795921
index: 7 loss: 3759975.53442384
index: 8 loss: 2355847.717950373
index: 9 loss: 1569720.3495294875
index: 10 loss: 1118918.822694601
index: 11 loss: 846395.336106484
index: 12 loss: 669992.5621650221
index: 13 loss: 547668.6769875139
index: 14 loss: 457526.1008550389
index: 15 loss: 388029.46034035634
index: 16 loss: 332591.1598039481
index: 17 loss: 287396.0471030898
index: 18 loss: 249925.20021737402
index: 19 loss: 218433.5044875259
index: 20 loss: 191719.57327774086
...
index: 480 loss: 9.941666040553277e-06
index: 481 loss: 9.534154972461822e-06
index: 482 loss: 9.14340997886859e-06
index: 483 loss: 8.768607802731639e-06
index: 484 loss: 8.409279449369022e-06
index: 485 loss: 8.06474956779924e-06
index: 486 loss: 7.734277961509694e-06
index: 487 loss: 7.4173103039458215e-06
index: 488 loss: 7.113410993466194e-06
index: 489 loss: 6.8219551974512366e-06
index: 490 loss: 6.542470533809243e-06
index: 491 loss: 6.274452711234617e-06
index: 492 loss: 6.017395187113872e-06
index: 493 loss: 5.770911167284456e-06
index: 494 loss: 5.534570706870234e-06
index: 495 loss: 5.308066213162415e-06
index: 496 loss: 5.090699571016041e-06
index: 497 loss: 4.882202727606714e-06
index: 498 loss: 4.682264822265037e-06
index: 499 loss: 4.490566406027399e-06

To compare, the end value at the 500th itearations is 4.490566406027399e-06 using torch while it is 3.4810279686457593 using numpy code.

My question is why and how this happens given that the code performs the exact same operations. Does it have to do with the torch ndarray representation? Or something different in the operations?

I cannot reproduce this issue.
If I clone the numpy arrays into PyTorch tensors, I get the same results:

import torch
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

x_copy = x.copy()
y_copy = y.copy()
w1_copy = w1.copy()
w2_copy = w2.copy()

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x_pth = torch.from_numpy(x_copy).clone()
y_pth = torch.from_numpy(y_copy).clone()

# Randomly initialize weights
w1_pth = torch.from_numpy(w1_copy).clone()
w2_pth = torch.from_numpy(w2_copy).clone()

learning_rate = 1e-6
for t in range(500):
    # numpy
    print('###########ITER{}#############'.format(t))
    
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    #print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    
    # PyTorch
    # Forward pass: compute predicted y
    h_pth = x_pth.mm(w1_pth)
    h_relu_pth = h_pth.clamp(min=0)
    y_pred_pth = h_relu_pth.mm(w2_pth)

    # Compute and print loss
    loss_pth = (y_pred_pth - y_pth).pow(2).sum().item()
    #print(t, loss_pth)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred_pth = 2.0 * (y_pred_pth - y_pth)
    grad_w2_pth = h_relu_pth.t().mm(grad_y_pred_pth)
    grad_h_relu_pth = grad_y_pred_pth.mm(w2_pth.t())
    grad_h_pth = grad_h_relu_pth.clone()
    grad_h_pth[h_pth < 0] = 0
    grad_w1_pth = x_pth.t().mm(grad_h_pth)

    # Update weights using gradient descent
    w1_pth -= learning_rate  * grad_w1_pth
    w2_pth -= learning_rate * grad_w2_pth


    print('h diff {}'.format(np.abs((h - h_pth.numpy())).sum()))
    print('h_relu diff {}'.format(np.abs((h_relu - h_relu_pth.numpy())).sum()))
    print('y_pred diff {}'.format(np.abs((y_pred - y_pred_pth.numpy())).sum()))
    print('loss diff {}'.format(np.abs((loss - loss_pth)).sum()))
    print('grad_w1 diff {}'.format(np.abs((grad_w1 - grad_w1_pth.numpy())).sum()))
    print('grad_w2 diff {}'.format(np.abs((grad_w2 - grad_w2_pth.numpy())).sum()))
    print('w1 diff {}'.format(np.abs((w1 - w1_pth.numpy())).sum()))
    print('w2 diff {}'.format(np.abs((w2 - w2_pth.numpy())).sum()))

This is odd. Running your code, I see you are right and there is barely any difference. But my code below does still produce a significant variance. If it were possible to share my notebook directly I would, to help identify the issue more easily.

import numpy as np
import torch

# N is batch size; D_in is input dimension; H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initizalize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

x_pth = torch.from_numpy(x.copy())
y_pth = torch.from_numpy(y.copy())
w1_pth = torch.from_numpy(w1.copy())
w2_pth = torch.from_numpy(w2.copy())

for t in range(500):
    print('###########ITER{}#############'.format(t))

    # Numpy
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print("n::index: {} loss: {}".format(t, loss))
    
    # Backprop to copmute gradients of w1 and w2 wrt. loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 9
    grad_w1 = x.T.dot(grad_h)
    
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    
    # Pytorch
    # Forward pass: compute predicted y
    h_pth = x_pth.mm(w1_pth)
    h_relu_pth = h_pth.clamp(min=0)
    y_pred_pth = h_relu_pth.mm(w2_pth)

    # Compute and print loss
    loss_pth = (y_pred_pth - y_pth).pow(2).sum().item()
    print("t::index: {} loss: {} ".format(t, loss_pth))
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred_pth = 2.0 * (y_pred_pth - y_pth)
    grad_w2_pth = h_relu_pth.t().mm(grad_y_pred_pth)
    grad_h_relu_pth = grad_y_pred_pth.mm(w2_pth.t())
    grad_h_pth = grad_h_relu_pth.clone()
    grad_h_pth[h_pth < 0] = 0
    grad_w1_pth = x_pth.t().mm(grad_h_pth)

    # Update weights using gradient descent
    w1_pth -= learning_rate  * grad_w1_pth
    w2_pth -= learning_rate * grad_w2_pth
    
    print('h diff {}'.format(np.abs((h - h_pth.numpy())).sum()))
    print('h_relu diff {}'.format(np.abs((h_relu - h_relu_pth.numpy())).sum()))
    print('y_pred diff {}'.format(np.abs((y_pred - y_pred_pth.numpy())).sum()))
    print('loss diff {}'.format(np.abs((loss - loss_pth)).sum()))
    print('grad_w1 diff {}'.format(np.abs((grad_w1 - grad_w1_pth.numpy())).sum()))
    print('grad_w2 diff {}'.format(np.abs((grad_w2 - grad_w2_pth.numpy())).sum()))
    print('w1 diff {}'.format(np.abs((w1 - w1_pth.numpy())).sum()))
    print('w2 diff {}'.format(np.abs((w2 - w2_pth.numpy())).sum()))

With the output as below:

###########ITER0#############
n::index: 0 loss: 32921574.04117705
t::index: 0 loss: 32921574.041177053 
h diff 7.421582792765946e-11
h_relu diff 3.725270092402866e-11
y_pred diff 6.048528344848592e-11
loss diff 3.725290298461914e-09
grad_w1 diff 4065353.2335717604
grad_w2 diff 5.597303243121132e-08
w1 diff 4.06535323357177
w2 diff 5.757894161462218e-14
###########ITER1#############
n::index: 1 loss: 29717226.08150243
t::index: 1 loss: 29716603.051671255 
h diff 32.851621051144484
h_relu diff 3.949965047703455
y_pred diff 5.644082478416449
loss diff 623.0298311747611
grad_w1 diff 4097372.660340737
grad_w2 diff 6256.570590947957
w1 diff 8.141774648305152
w2 diff 0.006256570590955269
###########ITER2#############
n::index: 2 loss: 29378011.732096307
t::index: 2 loss: 29375152.464387543 
h diff 65.90636224617357
h_relu diff 7.70129499392989
y_pred diff 11.793809195840083
loss diff 2859.26770876348
grad_w1 diff 4632879.111421494
grad_w2 diff 21829.400199091036
w1 diff 12.629894861618949
w2 diff 0.017764305125805478
###########ITER3#############
n::index: 3 loss: 27489576.46335771
t::index: 3 loss: 27483470.981920168 
h diff 100.29585220241407
h_relu diff 12.499866559304008
y_pred diff 28.255401737714053
loss diff 6105.481437541544
grad_w1 diff 4154139.660371361
grad_w2 diff 50856.66975089095
w1 diff 16.683484950458865
w2 diff 0.03883599146424179
###########ITER4#############
n::index: 4 loss: 22296338.98998391
t::index: 4 loss: 22286859.04007204 
h diff 133.65578924936236
h_relu diff 15.580080755815128
y_pred diff 33.056447633272555
loss diff 9479.949911870062
grad_w1 diff 5447373.3723421255
grad_w2 diff 72172.6835485037
w1 diff 21.768199516712993
w2 diff 0.0525584130160989
###########ITER5#############
n::index: 5 loss: 15345585.398344453
t::index: 5 loss: 15335875.521069933 
h diff 169.71651626780434
h_relu diff 20.392333517423538
y_pred diff 47.42009801759393
loss diff 9709.877274520695
grad_w1 diff 5828125.888908837
grad_w2 diff 84795.76890928381
w1 diff 25.97968867985098
w2 diff 0.05902704816940646
###########ITER6#############
n::index: 6 loss: 9232704.192372538
t::index: 6 loss: 9225210.36973124 
h diff 204.20324061852884
h_relu diff 23.868457156467933
y_pred diff 44.66486896553138
loss diff 7493.822641298175
grad_w1 diff 6907505.512636057
grad_w2 diff 80146.10717293946
w1 diff 30.82451465524113
w2 diff 0.06368321225692193
###########ITER7#############
n::index: 7 loss: 5251849.887477319
t::index: 7 loss: 5246458.369704635 
h diff 240.90353388003086
h_relu diff 28.08783937229876
y_pred diff 46.3787750814962
loss diff 5391.517772683874
grad_w1 diff 5930602.96768639
grad_w2 diff 77033.03824488565
w1 diff 34.53565634272817
w2 diff 0.06530425787581359
###########ITER8#############
n::index: 8 loss: 3053754.8791371025
t::index: 8 loss: 3050640.394549883 
h diff 274.13865156123273
h_relu diff 30.923792709785612
y_pred diff 39.57013196666416
loss diff 3114.4845872195438
grad_w1 diff 5487288.835529647
grad_w2 diff 67542.40392124449
w1 diff 38.15194022822821
w2 diff 0.07083747419993928
###########ITER9#############
n::index: 9 loss: 1921673.2266638656
t::index: 9 loss: 1920219.8657676687 
h diff 306.4014924934821
h_relu diff 33.70117231707645
y_pred diff 33.75986295700867
loss diff 1453.3608961969148
grad_w1 diff 5442978.6257482115
grad_w2 diff 49869.4073642062
w1 diff 42.38886121287388
w2 diff 0.07484470701563689
###########ITER10#############
n::index: 10 loss: 1330123.295769027
t::index: 10 loss: 1329418.3228490907 
h diff 341.3403577921275
h_relu diff 37.25304560454123
y_pred diff 30.39811628047485
loss diff 704.9729199362919
grad_w1 diff 5132185.077664573
grad_w2 diff 38357.65959846997
w1 diff 46.381301115210306
w2 diff 0.08300534818959644
###########ITER11#############
n::index: 11 loss: 999753.3911197875
t::index: 11 loss: 999405.8787960429 
h diff 375.15028791470417
h_relu diff 40.83456131395981
y_pred diff 31.71397446635291
loss diff 347.51232374459505
grad_w1 diff 5932203.782850905
grad_w2 diff 32802.52909621535
w1 diff 50.60735816030332
w2 diff 0.09052511493740231
###########ITER12#############
n::index: 12 loss: 796585.7485236578
t::index: 12 loss: 796404.2176835143 
h diff 410.24142951889814
h_relu diff 44.2111043231081
y_pred diff 29.956350109079377
loss diff 181.53084014344495
grad_w1 diff 4957972.673597248
grad_w2 diff 27372.540624729692
w1 diff 54.69720739612023
w2 diff 0.1001565108479465
###########ITER13#############
n::index: 13 loss: 658537.3865504961
t::index: 13 loss: 658460.6496376303 
h diff 444.1385592404112
h_relu diff 47.863329135995954
y_pred diff 29.934917498716345
loss diff 76.73691286577377
grad_w1 diff 5486975.40273445
grad_w2 diff 21757.613765100577
w1 diff 58.934971317193614
w2 diff 0.10937096448058198
###########ITER14#############
n::index: 14 loss: 556840.7283765355
t::index: 14 loss: 556680.7891103958 
h diff 478.4496794463024
h_relu diff 51.842191473671456
y_pred diff 32.655434592517814
loss diff 159.9392661396414
grad_w1 diff 5119672.107542723
grad_w2 diff 21445.267503716026
w1 diff 63.274365666717586
w2 diff 0.12021751835337104
###########ITER15#############
n::index: 15 loss: 477590.97826642107
t::index: 15 loss: 477371.19777239265 
h diff 513.3013634136213
h_relu diff 56.07150943975372
y_pred diff 35.818128741063575
loss diff 219.78049402841134
grad_w1 diff 4907342.792520579
grad_w2 diff 21324.008491974244
w1 diff 67.65335537608634
w2 diff 0.13232362760276334
###########ITER16#############
n::index: 16 loss: 413443.9153701274
t::index: 16 loss: 413192.7212003227 
h diff 548.0530311010582
h_relu diff 60.197808665706134
y_pred diff 38.95497092636908
loss diff 251.1941698047449
grad_w1 diff 4654628.844691873
grad_w2 diff 23385.62361626742
w1 diff 72.02600290918097
w2 diff 0.14818970172861626
###########ITER17#############
n::index: 17 loss: 360279.78958790516
t::index: 17 loss: 360010.9080516224 
h diff 583.0356745530685
h_relu diff 64.5717355553377
y_pred diff 41.848889142989854
loss diff 268.8815362827736
grad_w1 diff 4765645.129138498
grad_w2 diff 25264.28836383774
w1 diff 76.41762143226374
w2 diff 0.16638939067001274
###########ITER18#############
n::index: 18 loss: 315567.6354686282
t::index: 18 loss: 315265.41033084283 
h diff 617.9945335436237
h_relu diff 68.75319663930898
y_pred diff 44.350411035057604
loss diff 302.22513778536813
grad_w1 diff 4862869.189882157
grad_w2 diff 26465.3378510822
w1 diff 80.7644788936617
w2 diff 0.18686854845254386
###########ITER19#############
n::index: 19 loss: 277543.3436425855
t::index: 19 loss: 277227.5782861173 
h diff 652.948931992868
h_relu diff 72.98809341513825
y_pred diff 46.15486699778341
loss diff 315.7653564682114
grad_w1 diff 4686624.649998466
grad_w2 diff 27283.65426742912
w1 diff 85.12747018997997
w2 diff 0.20892718789666725
###########ITER20#############
n::index: 20 loss: 244967.30981629543
t::index: 20 loss: 244655.07691135723 
h diff 688.1228156937676
h_relu diff 77.19169557088841
y_pred diff 47.969674100178324
loss diff 312.23290493819513
grad_w1 diff 4643268.257238863
grad_w2 diff 27748.782356941527
w1 diff 89.46858489587751
w2 diff 0.23298862429325531
###########ITER21#############
n::index: 21 loss: 216897.06271621998
t::index: 21 loss: 216589.3739995007 
h diff 722.906787017662
h_relu diff 81.31356089758239
y_pred diff 48.60732965668928
loss diff 307.6887167192763
grad_w1 diff 4820242.780492581
grad_w2 diff 27817.680072768773
w1 diff 93.8397455737954
w2 diff 0.2578288580862381
###########ITER22#############
n::index: 22 loss: 192627.71027329218
t::index: 22 loss: 192309.12557466573 
h diff 757.9195258109
h_relu diff 85.66614140762141
y_pred diff 51.198557889787494
loss diff 318.5846986264514
grad_w1 diff 4912817.826995057
grad_w2 diff 28732.261115007343
w1 diff 98.18441510044349
w2 diff 0.28314658796556913
###########ITER23#############
n::index: 23 loss: 171535.88160854025
t::index: 23 loss: 171213.42129400466 
h diff 792.6400648177547
h_relu diff 89.9436348907906
y_pred diff 52.82392572209477
loss diff 322.4603145355941
grad_w1 diff 4627441.410161908
grad_w2 diff 29559.396923227574
w1 diff 102.55506118194776
w2 diff 0.30882039294567754
###########ITER24#############
n::index: 24 loss: 153145.83468778513
t::index: 24 loss: 152810.21832797892 
h diff 827.694457089782
h_relu diff 94.09922591205346
y_pred diff 54.269409920305726
loss diff 335.61635980621213
grad_w1 diff 4654913.342507393
grad_w2 diff 30253.33329715385
w1 diff 106.90968944035255
w2 diff 0.33478304159946326
###########ITER25#############
n::index: 25 loss: 137041.24867368996
t::index: 25 loss: 136709.30441158268 
h diff 862.612059616673
h_relu diff 98.26498039169421
y_pred diff 56.09216617465978
loss diff 331.94426210728125
grad_w1 diff 4466299.838804316
grad_w2 diff 30980.585998806815
w1 diff 111.2686000467291
w2 diff 0.3609959049226128
###########ITER26#############
n::index: 26 loss: 122887.82791664226
t::index: 26 loss: 122564.23181223658 
h diff 897.8307820479785
h_relu diff 102.58553277381107
y_pred diff 57.16581187916799
loss diff 323.59610440567485
grad_w1 diff 4508435.311541969
grad_w2 diff 31010.56863494055
w1 diff 115.61616918199663
w2 diff 0.3875253906617886
###########ITER27#############
n::index: 27 loss: 110403.82213164843
t::index: 27 loss: 110094.29230002608 
h diff 932.9781494205951
h_relu diff 106.81635478578202
y_pred diff 57.67424162730414
loss diff 309.52983162234887
grad_w1 diff 4591035.296490643
grad_w2 diff 30910.061112695726
w1 diff 119.95216290970134
w2 diff 0.4140031473007964
###########ITER28#############
n::index: 28 loss: 99364.04851952268
t::index: 28 loss: 99067.9955479744 
h diff 968.0478953760976
h_relu diff 110.99817081607357
y_pred diff 58.18108993428267
loss diff 296.0529715482844
grad_w1 diff 4621122.590262412
grad_w2 diff 30611.30877873681
w1 diff 124.2909849903257
w2 diff 0.44044853505653503
###########ITER29#############
n::index: 29 loss: 89578.25957622746
t::index: 29 loss: 89301.92590716488 
h diff 1003.0600808302864
h_relu diff 115.05637833050388
y_pred diff 58.518257967613735
loss diff 276.3336690625729
grad_w1 diff 4605030.439778769
grad_w2 diff 30427.31537160011
w1 diff 128.61552776181054
w2 diff 0.4669243451916384
###########ITER30#############
n::index: 30 loss: 80896.54950536517
t::index: 30 loss: 80633.55903086296 
h diff 1038.1736792617212
h_relu diff 119.2508192720604
y_pred diff 58.677424841891444
loss diff 262.9904745022068
grad_w1 diff 4455846.168271583
grad_w2 diff 29854.997945030467
w1 diff 132.92900689508738
w2 diff 0.4931911487911357
...
###########ITER470#############
n::index: 470 loss: 3.022363071406211
t::index: 470 loss: 7.112493918446718e-07 
h diff 16491.101166782944
h_relu diff 1668.4926409985273
y_pred diff 34.71925804861987
loss diff 3.0223623601568192
grad_w1 diff 4265592.654670648
grad_w2 diff 9951.835536734867
w1 diff 2003.5104647125004
w2 diff 4.756869629692067
###########ITER471#############
n::index: 471 loss: 3.0222261373222743
t::index: 471 loss: 6.768216698630003e-07 
h diff 16526.298533806978
h_relu diff 1671.986782526966
y_pred diff 34.716706665620606
loss diff 3.0222254605006045
grad_w1 diff 4265592.485263649
grad_w2 diff 9950.797899380843
w1 diff 2007.7715988582781
w2 diff 4.766442127139085
###########ITER472#############
n::index: 472 loss: 3.0221319088811027
t::index: 472 loss: 6.440554654015919e-07 
h diff 16561.495907523487
h_relu diff 1675.4809294877555
y_pred diff 34.71469914658215
loss diff 3.0221312648256373
grad_w1 diff 4265592.388783819
grad_w2 diff 9949.928431028884
w1 diff 2012.032733370082
w2 diff 4.7760140361884815
###########ITER473#############
n::index: 473 loss: 3.0220717040707523
t::index: 473 loss: 6.128791516697395e-07 
h diff 16596.693285245223
h_relu diff 1678.9750791499991
y_pred diff 34.71305932341892
loss diff 3.0220710911916004
grad_w1 diff 4265592.310781509
grad_w2 diff 9949.309331201035
w1 diff 2016.2938845910696
w2 diff 4.785586826550382
###########ITER474#############
n::index: 474 loss: 3.0220369834185776
t::index: 474 loss: 5.832317293963657e-07 
h diff 16631.890664254337
h_relu diff 1682.469229437801
y_pred diff 34.71170108704407
loss diff 3.022036400186848
grad_w1 diff 4265592.222252181
grad_w2 diff 9948.84666444736
w1 diff 2020.5550543881354
w2 diff 4.795162064511838
###########ITER475#############
n::index: 475 loss: 3.022022565160633
t::index: 475 loss: 5.550130952631671e-07 
h diff 16667.088042648782
h_relu diff 1685.9633787920566
y_pred diff 34.71056151448348
loss diff 3.0220220101475377
grad_w1 diff 4265592.256154745
grad_w2 diff 9948.484705207793
w1 diff 2024.8162274202468
w2 diff 4.804737111770125
###########ITER476#############
n::index: 476 loss: 3.0220233912119587
t::index: 476 loss: 5.28169012018736e-07 
h diff 16702.2854200945
h_relu diff 1689.457526538753
y_pred diff 34.709590739389995
loss diff 3.0220228630429466
grad_w1 diff 4265592.233703637
grad_w2 diff 9948.194058815487
w1 diff 2029.0774099729977
w2 diff 4.814312024893566
###########ITER477#############
n::index: 477 loss: 3.0220361781319554
t::index: 477 loss: 5.026228223237656e-07 
h diff 16737.48279540207
h_relu diff 1692.9516718182113
y_pred diff 34.70875042099399
loss diff 3.022035675509133
grad_w1 diff 4265592.276639848
grad_w2 diff 9947.96096483012
w1 diff 2033.3386044838949
w2 diff 4.823886872813779
###########ITER478#############
n::index: 478 loss: 3.0220577983379364
t::index: 478 loss: 4.783104840579037e-07 
h diff 16772.680471266725
h_relu diff 1696.4461170498262
y_pred diff 34.70801077428108
loss diff 3.0220573200274523
grad_w1 diff 4265592.273126945
grad_w2 diff 9947.772967045117
w1 diff 2037.5998089437928
w2 diff 4.833461689708546
###########ITER479#############
n::index: 479 loss: 3.022086101736787
t::index: 479 loss: 4.5517939583558885e-07 
h diff 16807.87837106376
h_relu diff 1699.940786289084
y_pred diff 34.70734836674451
loss diff 3.022085646557391
grad_w1 diff 4265592.247229605
grad_w2 diff 9947.622189416825
w1 diff 2041.8610306970716
w2 diff 4.843036514480825
###########ITER480#############
n::index: 480 loss: 3.0221191009928776
t::index: 480 loss: 4.331668814298805e-07 
h diff 16843.076267586457
h_relu diff 1703.4354523001998
y_pred diff 34.706774453025496
loss diff 3.022118667825996
grad_w1 diff 4265592.2446066635
grad_w2 diff 9947.500806799762
w1 diff 2046.122281501238
w2 diff 4.85261136679861
###########ITER481#############
n::index: 481 loss: 3.0221553692418164
t::index: 481 loss: 4.122217264774399e-07 
h diff 16878.27416097925
h_relu diff 1706.9301152024445
y_pred diff 34.70639816090044
loss diff 3.02215495702009
grad_w1 diff 4265592.2275030995
grad_w2 diff 9947.403686936263
w1 diff 2050.3835351868106
w2 diff 4.862206456310048
###########ITER482#############
n::index: 482 loss: 3.022193627239918
t::index: 482 loss: 3.9229440008288906e-07 
h diff 16913.472705113483
h_relu diff 1710.425429001471
y_pred diff 34.706090864421505
loss diff 3.022193234945518
grad_w1 diff 4265592.2174663395
grad_w2 diff 9947.325670593353
w1 diff 2054.6448369719556
w2 diff 4.871810782141205
###########ITER483#############
n::index: 483 loss: 3.0222329300083812
t::index: 483 loss: 3.733308438697089e-07 
h diff 16948.67146109846
h_relu diff 1713.9209547530993
y_pred diff 34.70586639917501
loss diff 3.022232556677537
grad_w1 diff 4265592.221839529
grad_w2 diff 9947.263280806485
w1 diff 2058.9061703805646
w2 diff 4.881415149170699
###########ITER484#############
n::index: 484 loss: 3.022272455442439
t::index: 484 loss: 3.5528995538533217e-07 
h diff 16983.870214459857
h_relu diff 1717.4164778909956
y_pred diff 34.70565463594324
loss diff 3.0222721001524837
grad_w1 diff 4265592.170695864
grad_w2 diff 9947.213089838431
w1 diff 2063.16750387219
w2 diff 4.891019563476636
###########ITER485#############
n::index: 485 loss: 3.0223059617137142
t::index: 485 loss: 3.381339170322886e-07 
h diff 17019.068964440426
h_relu diff 1720.9119657445194
y_pred diff 34.70541586342374
loss diff 3.0223056235797974
grad_w1 diff 4266336.048809359
grad_w2 diff 9947.179075624534
w1 diff 2067.4291139814027
w2 diff 4.9006240386719435
###########ITER486#############
n::index: 486 loss: 3.0217452715496522
t::index: 486 loss: 3.2179533998656984e-07 
h diff 17054.278466376294
h_relu diff 1724.4068620637308
y_pred diff 34.70349875163919
loss diff 3.021744949754312
grad_w1 diff 4266336.701381656
grad_w2 diff 9946.192015658922
w1 diff 2071.690777490773
w2 diff 4.910227986153808
###########ITER487#############
n::index: 487 loss: 3.021232577002208
t::index: 487 loss: 3.0624980569438243e-07 
h diff 17089.488372443142
h_relu diff 1727.9021601429138
y_pred diff 34.70144684988913
loss diff 3.0212322707524026
grad_w1 diff 4266337.162835328
grad_w2 diff 9945.041742258158
w1 diff 2075.9524426784865
w2 diff 4.919831159396214
###########ITER488#############
n::index: 488 loss: 3.020759745875726
t::index: 488 loss: 2.9145489533951045e-07 
h diff 17124.69827032508
h_relu diff 1731.3974480008849
y_pred diff 34.699375661373274
loss diff 3.0207594544208307
grad_w1 diff 4266337.720225545
grad_w2 diff 9944.159446826508
w1 diff 2080.214116616071
w2 diff 4.929433646296081
###########ITER489#############
n::index: 489 loss: 3.0203961701973476
t::index: 489 loss: 2.773772545790244e-07 
h diff 17159.90816271236
h_relu diff 1734.8925679484873
y_pred diff 34.697230464755606
loss diff 3.020395892820093
grad_w1 diff 4266971.645936875
grad_w2 diff 9943.21408538241
w1 diff 2084.476072572035
w2 diff 4.93904096467468
###########ITER490#############
n::index: 490 loss: 3.0203352814728848
t::index: 490 loss: 2.639851676233487e-07 
h diff 17195.12712137095
h_relu diff 1738.386575627914
y_pred diff 34.691215073020146
loss diff 3.0203350174877173
grad_w1 diff 4266970.70248623
grad_w2 diff 9943.369750456674
w1 diff 2088.7380757872975
w2 diff 4.948653154031385
###########ITER491#############
n::index: 491 loss: 3.020208143123704
t::index: 491 loss: 2.5123754803738525e-07 
h diff 17230.345999994366
h_relu diff 1741.8805190014236
y_pred diff 34.68560330118568
loss diff 3.0202078918861557
grad_w1 diff 4266969.966068109
grad_w2 diff 9942.941997861932
w1 diff 2093.000102802636
w2 diff 4.958268817540814
###########ITER492#############
n::index: 492 loss: 3.0200332378923953
t::index: 492 loss: 2.391079188865544e-07 
h diff 17265.564816579405
h_relu diff 1745.3744118290774
y_pred diff 34.680171790026655
loss diff 3.0200329987844765
grad_w1 diff 4266969.36865697
grad_w2 diff 9942.648070528334
w1 diff 2097.2621515357146
w2 diff 4.9678852809625305
###########ITER493#############
n::index: 493 loss: 3.0198160299562256
t::index: 493 loss: 2.2756419867576346e-07 
h diff 17300.783580705556
h_relu diff 1748.8682620526115
y_pred diff 34.67558780861723
loss diff 3.019815802392027
grad_w1 diff 4266968.90390586
grad_w2 diff 9942.14231114303
w1 diff 2101.5242203310495
w2 diff 4.9775016877879015
###########ITER494#############
n::index: 494 loss: 3.019568326494648
t::index: 494 loss: 2.1658221591064664e-07 
h diff 17336.002302988243
h_relu diff 1752.3620780295664
y_pred diff 34.67169438636162
loss diff 3.019568109912432
grad_w1 diff 4266968.513673546
grad_w2 diff 9941.718278721019
w1 diff 2105.786292451738
w2 diff 4.987118027760508
###########ITER495#############
n::index: 495 loss: 3.0192950607919644
t::index: 495 loss: 2.0613025503144356e-07 
h diff 17371.221108448182
h_relu diff 1755.8559837239338
y_pred diff 34.66869142492034
loss diff 3.0192948546617093
grad_w1 diff 4266968.216752986
grad_w2 diff 9941.176262810892
w1 diff 2110.0483728047802
w2 diff 4.996734142859358
###########ITER496#############
n::index: 496 loss: 3.019003690906242
t::index: 496 loss: 1.9618916314978162e-07 
h diff 17406.439922990292
h_relu diff 1759.3499034950066
y_pred diff 34.66579128656829
loss diff 3.019003494717079
grad_w1 diff 4266968.028958089
grad_w2 diff 9940.712321733265
w1 diff 2114.310466500564
w2 diff 5.006350026427327
###########ITER497#############
n::index: 497 loss: 3.0186980644124373
t::index: 497 loss: 1.867232446496014e-07 
h diff 17441.658715606434
h_relu diff 1762.843805062581
y_pred diff 34.663285505960175
loss diff 3.0186978776891924
grad_w1 diff 4266967.850151654
grad_w2 diff 9940.275117484918
w1 diff 2118.572568546297
w2 diff 5.015965598604594
###########ITER498#############
n::index: 498 loss: 3.0183829336241104
t::index: 498 loss: 1.7771451159144105e-07 
h diff 17476.877495017812
h_relu diff 1766.3376967301201
y_pred diff 34.66123338898663
loss diff 3.018382755909599
grad_w1 diff 4266967.736922134
grad_w2 diff 9939.867408159073
w1 diff 2122.8346736338863
w2 diff 5.025580854815926
###########ITER499#############
n::index: 499 loss: 3.018061127748044
t::index: 499 loss: 1.6914097895699502e-07 
h diff 17512.097056226972
h_relu diff 1769.8323727554684
y_pred diff 34.65914189898263
loss diff 3.018060958607065
grad_w1 diff 4266967.653397342
grad_w2 diff 9939.453846773271
w1 diff 2127.096822140588
w2 diff 5.035195755500524

P.S. I’ve tried on separate systems and the issue is replicated there, so running the code should give you similar results.

P.P.S: It turned out to be a bug in my code where I mistyped 0 for 9 in grad_h[h < 0] = 9 . Consider this post closed.

1 Like