I installed pytorch on several machines, from source and from
conda and I am getting different execution times for matrix multiplication. All installs are with
python 3.6. However I can’t figure out if pytorch is using MKL or OpenBLAS or other backend. Right now the macOS install is the fastest despite the machine having the slowest CPU.
The reason I ran these tests is because I noticed a severe slowdown (~10 times slower) of a multiprocessing RL algo I am working on when executed on the Linux machines.
On the Linux machines torch seems to be using only a single thread when doing the multiplication, as opposed to macOS. Even though
torch.get_num_threads() return the correct no of threads on each system.
macOS: Sierra, CPU: Intel i7-4870HQ (8) @ 2.50GHz, 16GB RAM, GeForce GT 750M. Installed from sources.
Allocation: 5.921 Torch Blas: 7.277 Numpy Blas: 7.841 Torch cuBlas: 0.205
Ubuntu 16.10, CPU: Intel i7-4720HQ (8) @ 3.60GHz, 16GB RAM, GeForce GTX 960M. Installed from sources.
Allocation: 4.030 Torch Blas: 21.112 Numpy Blas: 21.82 Torch cuBlas: 0.121
CentOS 7.2, CPU: Intel Xeon E5-2640v4 (40) @ 2.40GHz, 16GB RAM, Titan X. Installed both from source and with conda. Also ran the test with python 3.5 and pytorch built from sources.
Allocation: 4.557 Torch Blas: 19.646 Numpy Blas: 20.155 Torch cuBlas: 0.155
Finally, this is the output of
np.__config__.show() on all the machines:
openblas_lapack_info: NOT AVAILABLE lapack_opt_info: define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/anaconda3/include'] library_dirs = ['/opt/anaconda3/lib'] libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] blas_mkl_info: ... blas_opt_info: ... lapack_mkl_info: ...
The code I am using:
import time import torch import numpy torch.set_default_tensor_type("torch.FloatTensor") w = 5000 h = 40000 is_cuda = torch.cuda.is_available() start = time.time() a = torch.rand(w,h) b = torch.rand(h,w) a_np = a.numpy() b_np = b.numpy() if is_cuda: a_cu = a.cuda() b_cu = b.cuda() allocation = time.time() print("Allocation ", allocation - start) c = a.mm(b) th_blas = time.time() print("Torch Blas ", th_blas - allocation) c = a_np.dot(b_np) np_blas = time.time() print("Numpy Blas ", np_blas - th_blas) if is_cuda: c = a_cu.mm(b_cu) cu_blas = time.time() print("Torch cuBlas ", cu_blas - np_blas) print("Final", time.time() - start)
edit: For comparison here are the results of the same script on Lua Torch on the last machine from above:
Allocation: 4.426 Torch Blas: 2.777 Torch cuBlas: 0.097
At this point I am more inclined to believe my linux pytorch installs are using a BLAS fallback. Hoping this isn’t Python’s overhead…