I installed pytorch on several machines, from source and from conda
and I am getting different execution times for matrix multiplication. All installs are with Anaconda 4.3.0
, python 3.6
. However I can’t figure out if pytorch is using MKL or OpenBLAS or other backend. Right now the macOS install is the fastest despite the machine having the slowest CPU.
The reason I ran these tests is because I noticed a severe slowdown (~10 times slower) of a multiprocessing RL algo I am working on when executed on the Linux machines.
On the Linux machines torch seems to be using only a single thread when doing the multiplication, as opposed to macOS. Even though torch.get_num_threads()
return the correct no of threads on each system.
###Results:
macOS: Sierra, CPU: Intel i7-4870HQ (8) @ 2.50GHz, 16GB RAM, GeForce GT 750M. Installed from sources.
Allocation: 5.921
Torch Blas: 7.277
Numpy Blas: 7.841
Torch cuBlas: 0.205
Ubuntu 16.10, CPU: Intel i7-4720HQ (8) @ 3.60GHz, 16GB RAM, GeForce GTX 960M. Installed from sources.
Allocation: 4.030
Torch Blas: 21.112
Numpy Blas: 21.82
Torch cuBlas: 0.121
CentOS 7.2, CPU: Intel Xeon E5-2640v4 (40) @ 2.40GHz, 16GB RAM, Titan X. Installed both from source and with conda. Also ran the test with python 3.5 and pytorch built from sources.
Allocation: 4.557
Torch Blas: 19.646
Numpy Blas: 20.155
Torch cuBlas: 0.155
Finally, this is the output of np.__config__.show()
on all the machines:
openblas_lapack_info:
NOT AVAILABLE
lapack_opt_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda3/include']
library_dirs = ['/opt/anaconda3/lib']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
blas_mkl_info:
...
blas_opt_info:
...
lapack_mkl_info:
...
The code I am using:
import time
import torch
import numpy
torch.set_default_tensor_type("torch.FloatTensor")
w = 5000
h = 40000
is_cuda = torch.cuda.is_available()
start = time.time()
a = torch.rand(w,h)
b = torch.rand(h,w)
a_np = a.numpy()
b_np = b.numpy()
if is_cuda:
a_cu = a.cuda()
b_cu = b.cuda()
allocation = time.time()
print("Allocation ", allocation - start)
c = a.mm(b)
th_blas = time.time()
print("Torch Blas ", th_blas - allocation)
c = a_np.dot(b_np)
np_blas = time.time()
print("Numpy Blas ", np_blas - th_blas)
if is_cuda:
c = a_cu.mm(b_cu)
cu_blas = time.time()
print("Torch cuBlas ", cu_blas - np_blas)
print("Final", time.time() - start)
edit: For comparison here are the results of the same script on Lua Torch on the last machine from above:
Allocation: 4.426
Torch Blas: 2.777
Torch cuBlas: 0.097
At this point I am more inclined to believe my linux pytorch installs are using a BLAS fallback. Hoping this isn’t Python’s overhead…