PyTorch Tensor performance vs Numpy array

I am comparing two examples from the PyTorch tutorials site (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy).

Both of them are doing exactly the same thing (training an NN via ‘explicit’ backpropagation - no autograd), one of them using ‘numpy’, the other one replacing numpy arrays with Tensors. There is line by line equivalence between the two examples

The interesting (and confusing) thing is that the PyTorch implementation runs significantly faster relative to the ‘numpy’ one (on the same machine, CPU only, many repeated tests, results always consistent). Initially I got an approx 3x speedup with PyTorch. I realized that one explanation could be the Tensor dtype - ‘numpy’ seems to be using double precision and I was using dtype = torch.FloatTensor. But even after changing to dtype = torch.DoubleTensor the performance difference is still significant, approx 1.5x in favor of PyTorch.

Running things under a profiler produced the following results:

NUMPY VERSION:


Total time: 6.31922 s
File: <ipython-input-13-2060abc16fc5>
Function: as_numpy at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def as_numpy():
     4         1            3      3.0      0.0      start_time = timeit.default_timer()
     5        11            9      0.8      0.0      for counter in range(NREPS):
     6                                           
     7                                                   # Create random input and output data
     8        10        22161   2216.1      0.4          x = np.random.randn(N, D_in)
     9        10          336     33.6      0.0          y = np.random.randn(N, D_out)
    10                                           
    11                                                   # Randomly initialize weights
    12        10        34533   3453.3      0.5          w1 = np.random.randn(D_in, H)
    13        10          511     51.1      0.0          w2 = np.random.randn(H, D_out)
    14                                           
    15        10            9      0.9      0.0          learning_rate = 1e-6
    16      5010         6445      1.3      0.1          for t in range(500):
    17                                                       # Forward pass: compute predicted y
    18      5000      2283485    456.7     36.1              h = x.dot(w1)
    19      5000       223497     44.7      3.5              h_relu = np.maximum(h, 0)
    20      5000        92365     18.5      1.5              y_pred = h_relu.dot(w2)
    21                                           
    22                                                       # Compute and print loss
    23      5000       150941     30.2      2.4              loss = np.square(y_pred - y).sum()
    24                                           
    25                                                       # Backprop to compute gradients of w1 and w2 with respect to loss
    26      5000        28744      5.7      0.5              grad_y_pred = 2.0 * (y_pred - y)
    27      5000       112421     22.5      1.8              grad_w2 = h_relu.T.dot(grad_y_pred)
    28      5000        71577     14.3      1.1              grad_h_relu = grad_y_pred.dot(w2.T)
    29      5000        31708      6.3      0.5              grad_h = grad_h_relu.copy()
    30      5000       260032     52.0      4.1              grad_h[h < 0] = 0
    31      5000      1941999    388.4     30.7              grad_w1 = x.T.dot(grad_h)
    32                                           
    33                                                       # Update weights
    34      5000      1007904    201.6     15.9              w1 -= learning_rate * grad_w1
    35      5000        50394     10.1      0.8              w2 -= learning_rate * grad_w2
    36                                           
    37         1          142    142.0      0.0      print(timeit.default_timer() - start_time

PYTORCH TENSOR VERSION


Total time: 4.39713 s
File: <ipython-input-6-cdd37f3d1dd1>
Function: as_pytorch at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def as_pytorch():
     4         1            3      3.0      0.0      start_time = timeit.default_timer()
     5        11            7      0.6      0.0      for counter in range(NREPS):
     6                                           
     7                                                   # Create random input and output data
     8        10        28274   2827.4      0.6          x = torch.randn(N, D_in).type(dtype)
     9        10          457     45.7      0.0          y = torch.randn(N, D_out).type(dtype)
    10                                           
    11                                                   # Randomly initialize weights
    12        10        41973   4197.3      1.0          w1 = torch.randn(D_in, H).type(dtype)
    13        10          699     69.9      0.0          w2 = torch.randn(H, D_out).type(dtype)
    14                                           
    15        10           10      1.0      0.0          learning_rate = 1e-6
    16      5010         5371      1.1      0.1          for t in range(500):
    17                                                       # Forward pass: compute predicted y
    18      5000      1379133    275.8     31.4              h = x.mm(w1)
    19      5000        72269     14.5      1.6              h_relu = h.clamp(min=0)
    20      5000        77204     15.4      1.8              y_pred = h_relu.mm(w2)
    21                                           
    22                                                       # Compute and print loss
    23      5000       193164     38.6      4.4              loss = (y_pred - y).pow(2).sum()
    24                                           
    25                                                       # Backprop to compute gradients of w1 and w2 with respect to loss
    26      5000        44580      8.9      1.0              grad_y_pred = 2.0 * (y_pred - y)
    27      5000        78500     15.7      1.8              grad_w2 = h_relu.t().mm(grad_y_pred)
    28      5000        80033     16.0      1.8              grad_h_relu = grad_y_pred.mm(w2.t())
    29      5000        42751      8.6      1.0              grad_h = grad_h_relu.clone()
    30      5000       279782     56.0      6.4              grad_h[h < 0] = 0
    31      5000      1277310    255.5     29.0              grad_w1 = x.t().mm(grad_h)
    32                                           
    33                                                       # Update weights using gradient descent
    34      5000       743668    148.7     16.9              w1 -= learning_rate * grad_w1
    35      5000        51751     10.4      1.2              w2 -= learning_rate * grad_w2
    36                                           
    37         1          196    196.0      0.0      print(timeit.default_timer() - start_time)

It seems like the matrix multiplications are faster when carried out with Tensors than when with numpy array.
Is this expected ? I thought numpy is highly optimized (on CPUs), is there an explanation for why the Tensor ops are so much faster ?

2 Likes

If you install MKL it’s quite close (using Double in pytorch):

$ python example_numpy.py
(with openblas) 3.18742799759 seconds used

$ python example_numpy.py
(with MKL) 2.46531510353 seconds used

$ python example_pytorch.py
2.75312805176 seconds used
3 Likes