While practicing the PyTorch examples from Learning Pytorch with Examples in Google Colab notebooks, I came across some strange execution times.

The numpy version of two-layered neural network took a time of 26.286247730255127 sec to train for about 50000 epochs

Numpy Version

```
tic = time.time()
# 2-layer neural network
"""
N: Batch size
D_in: Input layer dimension
H: Hidden layer dimension
D_out: Output layer dimension
"""
N, D_in, H, D_out = 256, 32, 64, 5
X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Xavier initialization
W_1 = np.random.randn(D_in, H) / D_in
W_2 = np.random.randn(H, D_out) / H
# Learning rate
eta = 0.1
# Epochs
for i in range(50000):
# Forward pass
h = np.maximum(0, X @ W_1) # element-wise maximum # N x H
y_hat = h @ W_2 # N x D_out
# Compute loss
loss = np.sum((y-y_hat)**2) / (2*N) # Scalar
if (i+1)%5000 == 0:
print("Epoch", i+1, "|", "Loss", loss)
# Backward pass
grad_y_hat = -(1/N) * (y-y_hat) # N x D_out
grad_W_2 = h.T @ grad_y_hat # H x D_out
grad_h_relu = grad_y_hat @ W_2.T # N x H
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # N x H
grad_W_1 = X.T @ grad_h # D_in x H
# Weights update
W_1 -= eta * grad_W_1
W_2 -= eta * grad_W_2
toc = time.time()
print("Time taken:", toc-tic, "sec")
```

While the PyTorch’s tensors implementation of the same two-layer network took about 47.054221868515015 sec to train (on GPU) for the same number of epochs (50000)

Tensors Version

```
tic = time.time()
if torch.cuda.is_available():
device = torch.device("cuda:0")
print("We have", torch.cuda.device_count(), torch.cuda.get_device_name(), "GPU")
print(torch.cuda.get_device_properties(device=device))
else:
device = torch.device("cpu")
dtype = torch.float
# 2-layer neural network
"""
N: Batch size
D_in: Input layer dimension
H: Hidden layer dimension
D_out: Output layer dimension
"""
N, D_in, H, D_out = 256, 32, 64, 5
X = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Xavier initialization
W_1 = torch.randn(D_in, H, device=device, dtype=dtype) / D_in
W_2 = torch.randn(H, D_out, device=device, dtype=dtype) / H
W_1.requires_grad_(True)
W_2.requires_grad_(True)
# Learning rate
eta = 0.1
# Epochs
for i in range(50000):
# Forward pass
y_hat = X.mm(W_1).clamp(min=0).mm(W_2)
# Compute loss
loss = (y-y_hat).pow(2).sum() / (2*N)
if (i+1)%5000 == 0:
print("Epoch", i+1, "|", "Loss", loss.item())
# Backward pass
loss.backward()
# Weights update
with torch.no_grad():
W_1 -= eta * W_1.grad
W_2 -= eta * W_2.grad
W_1.grad.zero_()
W_2.grad.zero_()
toc = time.time()
print("Time taken:", toc-tic, "sec")
```

Shouldn’t the PyTorch’s tensor version on GPU have lower execution time than the numpy version? Why am I seeing such behavior?