I am running the Pytorch tutorial code given on this page.
It deals with creating a basic ReLU network trained with gradient descent to match a randomly generated output matrix.
The numpy code for the same is below:
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
While the pytoch code is here:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
My issue comes from the fact that when I run the logically same code (from below the for loop) on two identical matrices - got from converting the matrices from numpy to torch and vice versa as shown below:
x = x_copy.numpy()
y = y_copy.numpy()
w1 = w1_copy.numpy()
w2 = w2_copy.numpy()
or
x = torch.from_numpy(x_copy)
y = torch.from_numpy(y_copy)
w1 = torch.from_numpy(w1_copy)
w2 = torch.from_numpy(w2_copy)
respectively, the convergence graph differs wildly.
The numpy code gives the following loss transition:
index: 0 loss: 32461886.920378435
index: 1 loss: 27637965.695966113
index: 2 loss: 24769309.536160268
index: 3 loss: 20669833.99258788
index: 4 loss: 15293372.088940833
index: 5 loss: 10089927.685404746
index: 6 loss: 6205566.543079788
index: 7 loss: 3762003.192960376
index: 8 loss: 2356852.802093134
index: 9 loss: 1570218.63804998
index: 10 loss: 1119167.3165900838
index: 11 loss: 846504.7681372062
index: 12 loss: 670056.7573582053
index: 13 loss: 547736.9839199611
index: 14 loss: 457611.8523979621
index: 15 loss: 388126.11921249115
index: 16 loss: 332694.39484438574
index: 17 loss: 287504.80613440264
index: 18 loss: 250030.65232854194
index: 19 loss: 218545.42765090568
index: 20 loss: 191837.42046243855
...
index: 480 loss: 3.4945624488474176
index: 481 loss: 3.494177549413921
index: 482 loss: 3.4938240015619826
index: 483 loss: 3.4934986465915605
index: 484 loss: 3.493199029305159
index: 485 loss: 3.492922758173682
index: 486 loss: 3.4926678679706566
index: 487 loss: 3.492432506069835
index: 488 loss: 3.4922150942465526
index: 489 loss: 3.492014171723262
index: 490 loss: 3.4918284643947173
index: 491 loss: 3.4916568040109386
index: 492 loss: 3.4911714932805262
index: 493 loss: 3.49062017671815
index: 494 loss: 3.4902068107170057
index: 495 loss: 3.4882589525388568
index: 496 loss: 3.4863645270607586
index: 497 loss: 3.48474983737141
index: 498 loss: 3.4831788206912506
index: 499 loss: 3.4810279686457593
While torch gives a vastly different end loss value:
index: 0 loss: 32461886.920378435
index: 1 loss: 27637893.213226713
index: 2 loss: 24769381.523720074
index: 3 loss: 20670463.973418493
index: 4 loss: 15293362.230150145
index: 5 loss: 10086647.535185618
index: 6 loss: 6201742.552795921
index: 7 loss: 3759975.53442384
index: 8 loss: 2355847.717950373
index: 9 loss: 1569720.3495294875
index: 10 loss: 1118918.822694601
index: 11 loss: 846395.336106484
index: 12 loss: 669992.5621650221
index: 13 loss: 547668.6769875139
index: 14 loss: 457526.1008550389
index: 15 loss: 388029.46034035634
index: 16 loss: 332591.1598039481
index: 17 loss: 287396.0471030898
index: 18 loss: 249925.20021737402
index: 19 loss: 218433.5044875259
index: 20 loss: 191719.57327774086
...
index: 480 loss: 9.941666040553277e-06
index: 481 loss: 9.534154972461822e-06
index: 482 loss: 9.14340997886859e-06
index: 483 loss: 8.768607802731639e-06
index: 484 loss: 8.409279449369022e-06
index: 485 loss: 8.06474956779924e-06
index: 486 loss: 7.734277961509694e-06
index: 487 loss: 7.4173103039458215e-06
index: 488 loss: 7.113410993466194e-06
index: 489 loss: 6.8219551974512366e-06
index: 490 loss: 6.542470533809243e-06
index: 491 loss: 6.274452711234617e-06
index: 492 loss: 6.017395187113872e-06
index: 493 loss: 5.770911167284456e-06
index: 494 loss: 5.534570706870234e-06
index: 495 loss: 5.308066213162415e-06
index: 496 loss: 5.090699571016041e-06
index: 497 loss: 4.882202727606714e-06
index: 498 loss: 4.682264822265037e-06
index: 499 loss: 4.490566406027399e-06
To compare, the end value at the 500th itearations is 4.490566406027399e-06
using torch while it is 3.4810279686457593
using numpy code.
My question is why and how this happens given that the code performs the exact same operations. Does it have to do with the torch ndarray representation? Or something different in the operations?