I’m benchmarking pytorch on GPU (using openblas) vs numpy CPU, numexpr CPU, numba CPU and numba GPU. When comparing a*b
I get a bad performance with pytorch.
size_combinations=[
(100, 100),
(1000, 1000),
(10000, 10000),
(100000, 10000)
]
def factors_int(s1=100, s2=100):
a = np.random.randint(1, 5, (s1, s2), dtype=np.int16)
b = np.random.randint(1, 10, (s1, s2), dtype=np.int16)
return a, b
def multiply(a,b):
return a*b
def ne_multiply(a,b):
return ne.evaluate("a*b")
@vectorize(["int16(int16, int16)"], target="cpu")
def multicpu(a, b):
return a * b
@vectorize(["int16(int16, int16)"], target="cuda")
def multicuda(a, b):
return a * b
def pt_multiply(a,b):
at = torch.as_tensor([a]).cuda()
bt = torch.as_tensor([b]).cuda()
return at*bt
for s1, s2 in size_combinations:
a, b = factors_int(s1, s2)
r1 = %timeit -o multiply(a,b)
r2 = %timeit -o ne_multiply(a,b)
r3 = %timeit -o multicpu(a,b)
r4 = %timeit -o multicuda(a,b)
r5 = %timeit -o pt_multiply(a,b)
These are the results:
2.09 µs ± 8.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
456 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.92 µs ± 4.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.46 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.02 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
215 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
580 µs ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
216 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.45 ms ± 78.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
167 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
89 ms ± 443 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.5 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
87.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
136 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.4 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
896 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
176 ms ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
917 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.36 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2min 56s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pytorch is consistently worse for a great margin, and the numbers are far away from numba also on GPU. Is this behavior expected? is there something in the code that is wrong?
Thanks