Pytorch much slower than numpy for simple arithmetics

I was wondering if there is a way to limit the overhead of arithmetics in pytorch as this lets us code everything in pytorch as compared to only using pytorch when the gradient of the calculation is needed.

This gives much cleaner code in reinforcement learning as often small calculations are needed of arrays which don’t need to be backpropagated.

import numpy as np
import torch
a = np.array([1.,2.,3.,4.,5.])
%timeit 2*a    #703 ns ± 34.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a = torch.from_numpy(a)
%timeit 2*a    #7.99 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

First, a baseline on my system:

an = np.linspace(1, 5, 5, dtype=np.float32)
%timeit 2*an  # 1.27 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

a = torch.linspace(1, 5, 5)
%timeit 2*a  # 9.25 µs ± 49.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Maybe the problem is that broadcasting is slow in pytorch. If I do instead:

b = torch.ones_like(a) * 2
%timeit b*a  # 3.74 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But then if I do:

bn = np.ones_like(an) * 2
%timeit bn*an  # 609 ns ± 5.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So… numpy benefits from getting rid of broadcasting too. I’m not sure what to conclude here.

What if we don’t allocate new ndarrays/tensors in the inner loop?

c = torch.zeros_like(a)
%timeit torch.mul(b, a, out=c)  # 2.23 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cn = np.zeros_like(an)
%timeit np.multiply(bn, an, out=cn)  # 866 ns ± 36.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Allocation can’t be the main source of PyTorch overhead either. It still isn’t as fast as naive NumPy.

For reference, my benchmarks were performed on a 2-core MacBook Pro, with Python 3.7.4 installed via Homebrew, numpy 1.17.2 installed via pip (and thus built vs openblas), and torch 1.3.1 installed via pip.

Hi,

These benchmark are a very good example for what is the overhead of our framework. All these operations require no compute at all. So here you just measure how long it takes to go all the way down the stack and back up.
Unfortunately, because we support more hardware, layouts, and feature such as autograd, it is very hard to be as lean as numpy.
Hopefully for large enough operations, this overhead should be small enough to not be a problem in practical neural network operations.