Two conda installs with vastly different pytorch performance

deepbrain · May 8, 2018, 2:25am

I have two anaconda python installs - the older anaconda install runs my network 2-3x faster than the newer install. The pytorch is compiled from sources with identical options. Both installs run on the same machine and have seemingly the same packages installed with slightly different versions. I cant’t figure out why one of the conda installs is much more CPU intensive. The CPU on a slower install runs at 600-700% utilization while on a faster install it runs at 160% utilization. I could include the package lists and compile logs, but they are very large.

Does anyone have any ideas why it could behave like this?

I am curious what could be the reason of the slowdown.

SimonW · May 8, 2018, 4:49am

Try ldd and see which blas they bind to.

deepbrain · May 8, 2018, 3:15pm

Thank you, this helped a lot! The faster version was using openblas while the slower used the MKL. The new conda by default uses the MKL. Once I run "conda install nomkl "on the MKL version to remove the MKL, the performance increased 2x and the CPU utilization dropped 5x. I wonder why would the default MKL version in conda be much slower than the openblas?

SimonW · May 8, 2018, 4:55pm

Are you happening to use AMD CPUs by any chance?

deepbrain · May 8, 2018, 5:13pm

Actually I used a variety of Intel CPUs: 3 or 4 Intel i7 Gen 6,7, and 8. I used Cuda 8.0, 9.0, 9.1, cudnn 7.0, 7.1 and Nvidia GTX 1080ti GPUs. In all cases the GPUs are all severely underutilized when MKL is used. If I switch to openblas, the performance improves.

I know that MKL is supposed to be faster than openblas, but in all instances I get much slower performance with pytorch - CPU usage spikes 3-5x and GPU utilization drops 2-3x. This is counter intuitive and I spent a lot of time trying to figure it out. Could it be a bug?

colesbury · May 8, 2018, 5:54pm

Try adjusting the environment variable OMP_NUM_THREADS which affects the number of threads used by MKL (and usually OpenBLAS) for parallel workloads. The default is the number of virtual CPUs which is often too high on multi-core multi-socket systems. Try setting it to half the number of vCPUs or even fewer.

For example:

OMP_NUM_THREADS=8 python myscript.py

deepbrain · June 1, 2018, 6:13pm

Thank you! After some experimenting, I get almost the same speeds with the MKL. OMP_NUM_THREADS=1 or 2 works best for me. The default setting slows it a lot. I wonder why would the OMP_NUM_THREADS be too high by default? My setup could be specific to my research, but it seems that it could be slowing other jobs too. I guess with cuda and cudnn, MKL takes on very small chunks of data and if you split it in many threads, the thread setup overhead is killing the job.

Zhang_Chi · May 30, 2019, 5:09am

hi how do you use ldd to check which blas they bind to? what the command looks like?