Tensors are taking more time for execution than numpy arrays

While practicing the PyTorch examples from Learning Pytorch with Examples in Google Colab notebooks, I came across some strange execution times.

The numpy version of two-layered neural network took a time of 26.286247730255127 sec to train for about 50000 epochs

Numpy Version

tic = time.time()


# 2-layer neural network

"""
  N:      Batch size
  D_in:   Input layer dimension
  H:      Hidden layer dimension
  D_out:  Output layer dimension
"""

N, D_in, H, D_out = 256, 32, 64, 5

X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Xavier initialization
W_1 = np.random.randn(D_in, H) / D_in
W_2 = np.random.randn(H, D_out) / H

# Learning rate
eta = 0.1

# Epochs
for i in range(50000):
  # Forward pass
  h = np.maximum(0, X @ W_1) # element-wise maximum   # N x H
  y_hat = h @ W_2                                     # N x D_out
  
  # Compute loss
  loss = np.sum((y-y_hat)**2) / (2*N)               # Scalar
  if (i+1)%5000 == 0:
    print("Epoch", i+1, "|", "Loss", loss)
  
  # Backward pass
  grad_y_hat = -(1/N) * (y-y_hat)                     # N x D_out
  grad_W_2 = h.T @ grad_y_hat                         # H x D_out
  grad_h_relu = grad_y_hat @ W_2.T                    # N x H
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0                                   # N x H
  grad_W_1 = X.T @ grad_h                             # D_in x H
  
  # Weights update
  W_1 -= eta * grad_W_1
  W_2 -= eta * grad_W_2
  
  
toc = time.time()
print("Time taken:", toc-tic, "sec")

While the PyTorch’s tensors implementation of the same two-layer network took about 47.054221868515015 sec to train (on GPU) for the same number of epochs (50000)

Tensors Version

tic = time.time()

if torch.cuda.is_available():
  device = torch.device("cuda:0")
  print("We have", torch.cuda.device_count(), torch.cuda.get_device_name(), "GPU")
  print(torch.cuda.get_device_properties(device=device))
else:
  device = torch.device("cpu")

dtype = torch.float

# 2-layer neural network

"""
  N:      Batch size
  D_in:   Input layer dimension
  H:      Hidden layer dimension
  D_out:  Output layer dimension
"""

N, D_in, H, D_out = 256, 32, 64, 5

X = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Xavier initialization
W_1 = torch.randn(D_in, H, device=device, dtype=dtype) / D_in
W_2 = torch.randn(H, D_out, device=device, dtype=dtype) / H
W_1.requires_grad_(True)
W_2.requires_grad_(True)

# Learning rate
eta = 0.1

# Epochs
for i in range(50000):
  # Forward pass
  y_hat = X.mm(W_1).clamp(min=0).mm(W_2)
  
  # Compute loss
  loss = (y-y_hat).pow(2).sum() / (2*N)                            
  if (i+1)%5000 == 0:
    print("Epoch", i+1, "|", "Loss", loss.item())
  
  # Backward pass
  loss.backward()
  
  # Weights update
  with torch.no_grad():
    W_1 -= eta * W_1.grad
    W_2 -= eta * W_2.grad
    
    W_1.grad.zero_()
    W_2.grad.zero_()
  
toc = time.time()
print("Time taken:", toc-tic, "sec")

Shouldn’t the PyTorch’s tensor version on GPU have lower execution time than the numpy version? Why am I seeing such behavior?

First of all, you are measuring different stuff. Cuda requires time to be initialized on gpu and you should measure just the execution time. Sencondly, you are using very small matrices. GPU takes advantage of parallelization, but their cores are slower than a cpu, if you use a very small dimensionality you may get similar results.
Lastly, cuda has its own profiler, time is for cpu.

1 Like

Oh, now that makes sense. Thanks mate!