PyTorch Tensor slow on small matrix operations (large overhead?)

JohannesB · January 28, 2020, 9:05am

For small matrix operations on CPU (e.g., 3x3 matrices), NumPy seems to be approximately 15x faster than PyTorch. This is relevant for operations in the data loader or (geometric) transformations where I could achieve great speedups by using NumPy arrays instead tensors.

Here is a small performance test which just adds a constant to a 3x3 matrix:

>python -mtimeit -s"import numpy as np; test = np.random.rand(3,3)" "test + 1"
500000 loops, best of 5: 765 nsec per loop

> python -mtimeit -s "import torch; test = torch.rand((3, 3))" "test + 1"
20000 loops, best of 5: 11.2 usec per loop

I assume that the overhead for PyTorch tensors is much higher than for NumPy arrays. When using larger arrays and tensors, the overhead will be less relevant and I have seen contributions in this forum and on GitHub where the performances seem to be comparable.

I am working with PyTorch 1.3, NumPy 1.16.4 and Python 3.7.4 on a 64 bit Windows machine.

(Btw: I found some articles in this forum comparing the speed of specific operators and discussing the processing speed of large matrices but not about the overhead of the matrix classes itself. If I have missed an article – please let me know!)

ptrblck · January 29, 2020, 6:25am

I would assume PyTorch might have a larger overhead, but I’m getting similar numbers for your test:

%timeit "import numpy as np; test = np.random.rand(3,3)" "test + 1"
> 6.31 ns ± 0.188 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

%timeit "import torch; test = torch.rand((3, 3))" "test + 1"
> 6.12 ns ± 0.0184 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

for PyTorch 1.5.0.dev20200122.

JohannesB · January 29, 2020, 2:28pm

Did you get your results on a Linux/MacOS platform?

We tested the nightly Windows build 1.5.0.dev20200128, but the PyTorch tensor version still seems to be much slower than the NumPy array version. Could this be a platform-dependent problem?

Is there anyone who could run the small performance tests on a Windows platform as well?

ptrblck · January 29, 2020, 4:24pm

I used Ubuntu18.04. Sorry for not mentioning it.

G.M · January 30, 2020, 1:24am

I’m using Pytorch 1.4.0, numpy 1.16.3 with Python 3.7.3 on windows 10 64 bit.

>python -mtimeit -s"import numpy as np; test = np.random.rand(3,3)" "test + 1"
500000 loops, best of 5: 713 nsec per loop

>python -mtimeit -s "import torch; test = torch.rand((3, 3))" "test + 1"
50000 loops, best of 5: 5.14 usec per loop

peterjc123 · January 30, 2020, 4:30am

Result on Windows 10, Python 3.7.4 (Anaconda), NumPy 1.16.5, PyTorch 1.4.0.

python -mtimeit -s “import torch; test = torch.rand((3, 3))” “test + 1”
20000 loops, best of 5: 9.86 usec per loop

python -mtimeit -s"import numpy as np; test = np.random.rand(3,3)" “test + 1”
500000 loops, best of 5: 834 nsec per loop

I guess the reason is that some simd clauses are not enabled for MSVC.

ptrblck · January 30, 2020, 4:37am

I’m not sure why, but IPython’s %timeit command returns the initial low timings, while running the python -mtimeit cmd directly in the terminal returns:

python -mtimeit -s"import numpy as np; test = np.random.rand(3,3)" "test + 1"
500000 loops, best of 5: 894 nsec per loop

python -mtimeit -s "import torch; test = torch.rand((3, 3))" "test + 1"
50000 loops, best of 5: 4.9 usec per loop

so comparable to what others are seeing.
I’m not sure, which command to trust more and maybe someone knows, what’t the best way to time on the CPU.

JohannesB · January 30, 2020, 9:25am

I would trust the terminal python -mtimeit results since they reflect my experiences in my training framework: By replacing small PyTorch matrix operations by NumPy operations, I could achieve a great speed-up. (This mainly affects data loading and augmentation, which is not vectorizable in my case. The large training vector operations are of course carried out on the GPU.)

@peterjc123: I do not think that this has something to do with SIMD instructions since both of the above tests run on a single CPU as my task monitor indicates. Additionally, you could re-run the tests with a 1x1 matrix yielding approximately the same results.

Here is an alternative timing measurement delivering similar results:

import torch
import time
iterations = 10000
test_tensor = torch.zeros((1, 1))
test_array = test_tensor.numpy().copy()
time_start = time.time()
for i in range(iterations):
    test_tensor = test_tensor + 1
t_torch = (time.time() - time_start) / iterations
time_start = time.time()
for i in range(iterations):
    test_array = test_array + 1
t_numpy = (time.time() - time_start) / iterations
print('Time elapsed (PyTorch tensor): {0:.3f} µs'.format(t_torch * 1E6))
print('Time elapsed (NumPy array): {0:.3f} µs'.format(t_numpy * 1E6))

Time elapsed (PyTorch tensor): 9.798 µs
Time elapsed (NumPy array): 0.800 µs