Pytorch poor performance when multiplying matrices in comparison with numpy, numexpr and numba

miguelgfierro · January 22, 2019, 8:00pm

I’m benchmarking pytorch on GPU (using openblas) vs numpy CPU, numexpr CPU, numba CPU and numba GPU. When comparing a*b I get a bad performance with pytorch.

size_combinations=[
    (100, 100),
    (1000, 1000),
    (10000, 10000),
    (100000, 10000)
]
def factors_int(s1=100, s2=100):
    a = np.random.randint(1, 5, (s1, s2), dtype=np.int16)
    b = np.random.randint(1, 10, (s1, s2), dtype=np.int16)
    return a, b
def multiply(a,b):
    return a*b
def ne_multiply(a,b):
    return ne.evaluate("a*b")
@vectorize(["int16(int16, int16)"], target="cpu")
def multicpu(a, b):
    return a * b
@vectorize(["int16(int16, int16)"], target="cuda")
def multicuda(a, b):
    return a * b
def pt_multiply(a,b):
    at = torch.as_tensor([a]).cuda() 
    bt = torch.as_tensor([b]).cuda()
    return at*bt

for s1, s2 in size_combinations:
    a, b = factors_int(s1, s2)
    r1 = %timeit -o multiply(a,b)
    r2 = %timeit -o ne_multiply(a,b)
    r3 = %timeit -o multicpu(a,b)
    r4 = %timeit -o multicuda(a,b)
    r5 = %timeit -o pt_multiply(a,b)

These are the results:

2.09 µs ± 8.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
456 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.92 µs ± 4.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.46 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.02 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

215 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
580 µs ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
216 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.45 ms ± 78.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
167 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

89 ms ± 443 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.5 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
87.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
136 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.4 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

896 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
176 ms ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
917 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.36 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2min 56s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pytorch is consistently worse for a great margin, and the numbers are far away from numba also on GPU. Is this behavior expected? is there something in the code that is wrong?

Thanks

colesbury · January 22, 2019, 8:22pm

Your benchmarking is wrong:

You’re including the time it takes to copy the data into a tensor
You’re including the time it takes to copy the tensor to the GPU
Many CUDA calls are asynchronous and you’re not taking that into account
You’re using int16, which isn’t a common use case. I’m not sure how that affects perf, but it’s not a common target in PyTorch.

EDIT: It took me a while to realize that “multiplying matrices” meant element-wise multiplication not matrix multiplication.

colesbury · January 22, 2019, 8:29pm

In particular torch.as_tensor([a]) forces a slow copy because you wrap the NumPy array in a Python list.

But in general, you’re almost entirely measuring copying time here.

miguelgfierro · January 22, 2019, 8:48pm

Thanks for the quick answer

You’re including the time it takes to copy the data into a tensor

I wanted to avoid this. In the documentation it says: " If you have a numpy array and want to avoid a copy, use torch.as_tensor()". Then, what is wrong here?.

You’re including the time it takes to copy the tensor to the GPU

This is expected, in numba gpu I’m also measuring that time.

Many CUDA calls are asynchronous and you’re not taking that into account

Can you elaborate? how is that not happening in numba? what are these calls needed?

You’re using int16, which isn’t a common use case. I’m not sure how that affects perf, but it’s not a common target in PyTorch.

Fair enough, I’m also doing experiments with float32 and other operations (multiple matrices, exponentials, etc). The code (not finished) can be found here. I posted the partial results here because I was seeing a massive difference between numba gpu and pytorch, and I was surprised

colesbury · January 22, 2019, 9:10pm

I wanted to avoid this. In the documentation it says: " If you have a numpy array and want to avoid a copy, use torch.as_tensor()". Then, what is wrong here?.

As I wrote above, torch.as_tensor([a]) forces a slow copy because you wrap the NumPy array in a Python list. It’s not the same as torch.as_tensor(a) – type(a) is a NumPy ndarray; type([a]) is Python list.

I’m not sure what your use-case is. If you’re doing an element-wise multiplication of two arrays only once, it never makes sense to copy it to the GPU and back. Modern CPUs can multiply integers and floating point numbers faster than they can copy them to and from RAM (or the GPU). You’re going to be primarily measuring the time it takes to copy. If you want measure copy time, just measure copy time.

Regard asynchronous execution, see:
CUDA semantics — PyTorch 2.1 documentation.
1. Preface — CUDA C++ Best Practices Guide 12.3 documentation