# PyTorch Tensor performance vs Numpy array

I am comparing two examples from the PyTorch tutorials site (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy).

Both of them are doing exactly the same thing (training an NN via ‘explicit’ backpropagation - no autograd), one of them using ‘numpy’, the other one replacing numpy arrays with Tensors. There is line by line equivalence between the two examples

The interesting (and confusing) thing is that the PyTorch implementation runs significantly faster relative to the ‘numpy’ one (on the same machine, CPU only, many repeated tests, results always consistent). Initially I got an approx 3x speedup with PyTorch. I realized that one explanation could be the Tensor dtype - ‘numpy’ seems to be using double precision and I was using `dtype = torch.FloatTensor`. But even after changing to `dtype = torch.DoubleTensor` the performance difference is still significant, approx 1.5x in favor of PyTorch.

Running things under a profiler produced the following results:

NUMPY VERSION:

``````
Total time: 6.31922 s
File: <ipython-input-13-2060abc16fc5>
Function: as_numpy at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
3                                           def as_numpy():
4         1            3      3.0      0.0      start_time = timeit.default_timer()
5        11            9      0.8      0.0      for counter in range(NREPS):
6
7                                                   # Create random input and output data
8        10        22161   2216.1      0.4          x = np.random.randn(N, D_in)
9        10          336     33.6      0.0          y = np.random.randn(N, D_out)
10
11                                                   # Randomly initialize weights
12        10        34533   3453.3      0.5          w1 = np.random.randn(D_in, H)
13        10          511     51.1      0.0          w2 = np.random.randn(H, D_out)
14
15        10            9      0.9      0.0          learning_rate = 1e-6
16      5010         6445      1.3      0.1          for t in range(500):
17                                                       # Forward pass: compute predicted y
18      5000      2283485    456.7     36.1              h = x.dot(w1)
19      5000       223497     44.7      3.5              h_relu = np.maximum(h, 0)
20      5000        92365     18.5      1.5              y_pred = h_relu.dot(w2)
21
22                                                       # Compute and print loss
23      5000       150941     30.2      2.4              loss = np.square(y_pred - y).sum()
24
25                                                       # Backprop to compute gradients of w1 and w2 with respect to loss
26      5000        28744      5.7      0.5              grad_y_pred = 2.0 * (y_pred - y)
30      5000       260032     52.0      4.1              grad_h[h < 0] = 0
32
33                                                       # Update weights
34      5000      1007904    201.6     15.9              w1 -= learning_rate * grad_w1
35      5000        50394     10.1      0.8              w2 -= learning_rate * grad_w2
36
37         1          142    142.0      0.0      print(timeit.default_timer() - start_time
``````

PYTORCH TENSOR VERSION

``````
Total time: 4.39713 s
File: <ipython-input-6-cdd37f3d1dd1>
Function: as_pytorch at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
3                                           def as_pytorch():
4         1            3      3.0      0.0      start_time = timeit.default_timer()
5        11            7      0.6      0.0      for counter in range(NREPS):
6
7                                                   # Create random input and output data
8        10        28274   2827.4      0.6          x = torch.randn(N, D_in).type(dtype)
9        10          457     45.7      0.0          y = torch.randn(N, D_out).type(dtype)
10
11                                                   # Randomly initialize weights
12        10        41973   4197.3      1.0          w1 = torch.randn(D_in, H).type(dtype)
13        10          699     69.9      0.0          w2 = torch.randn(H, D_out).type(dtype)
14
15        10           10      1.0      0.0          learning_rate = 1e-6
16      5010         5371      1.1      0.1          for t in range(500):
17                                                       # Forward pass: compute predicted y
18      5000      1379133    275.8     31.4              h = x.mm(w1)
19      5000        72269     14.5      1.6              h_relu = h.clamp(min=0)
20      5000        77204     15.4      1.8              y_pred = h_relu.mm(w2)
21
22                                                       # Compute and print loss
23      5000       193164     38.6      4.4              loss = (y_pred - y).pow(2).sum()
24
25                                                       # Backprop to compute gradients of w1 and w2 with respect to loss
26      5000        44580      8.9      1.0              grad_y_pred = 2.0 * (y_pred - y)
30      5000       279782     56.0      6.4              grad_h[h < 0] = 0
32
33                                                       # Update weights using gradient descent
34      5000       743668    148.7     16.9              w1 -= learning_rate * grad_w1
35      5000        51751     10.4      1.2              w2 -= learning_rate * grad_w2
36
37         1          196    196.0      0.0      print(timeit.default_timer() - start_time)
``````

It seems like the matrix multiplications are faster when carried out with Tensors than when with numpy array.
Is this expected ? I thought numpy is highly optimized (on CPUs), is there an explanation for why the Tensor ops are so much faster ?

2 Likes

If you install MKL it’s quite close (using Double in pytorch):

``````\$ python example_numpy.py
(with openblas) 3.18742799759 seconds used

\$ python example_numpy.py
(with MKL) 2.46531510353 seconds used

\$ python example_pytorch.py
2.75312805176 seconds used``````
3 Likes