PyTorch CPU Slowdown: OSX vs. other Linux

hyrox · September 19, 2017, 5:11pm

I have installed PyTorch on several different machines, and I am consistently seeing the OS X CPU-only performance outperforming PyTorch on other platforms, even though the other machines have better CPUs.

I was using this post as a point of reference, as the symptoms sound similar. However, I ultimately found that while installing with the iomp5 flag (which now appears to be integrated into the master branch) does indeed produce faster matrix multiplication on the server, overall performance still seems to be slower.

My target application is kind of a monster, so my first step at isolating the issue was trying out some of the PyTorch examples. For example, I ran the mnist example on two different machines, and the example finished 10 epochs almost twice as fast on my MacBook Pro than on the server. Here are the specifications for each machine (very similar to the post linked above):

macOS: Sierra, CPU: Intel i7-4770HQ (8) @ 2.20GHz, 16GB RAM
CentOS 7.0, CPU: Intel Xeon E5-2695v2 (48) @ 2.40GHz, 384GB RAM

I tried installing with conda and from source, but the results are the same.

Has anyone seen similar behavior?

What would be some useful further steps to isolate the cause of the performance differences?

Are there more useful benchmarking scripts to use?

Thanks.

smth · September 20, 2017, 4:40am

If you have a Xeon server with 48 hyperthreads (2 sockets, 24 physical cores), you will almost surely have worse perf by default than i7 with 8 cores.

The reason here is that Intel MKL and Intel OpenMP which PyTorch uses to parallelize it’s CPU code, try to use all possible cores by default.
However, using all cores is not great in many cases. And its especially not great when you have 12 cores split across 2 sockets, and oversubscribed with hyperthreads.

Try to run your python program like this on your Linux machine:

OMP_NUM_THREADS=8 MKL_NUM_THREADS=8 python foo.py

Try different values than 8 from 1 to 12. Going more than 12 will surely be bad perf, unless you have very large workloads.

FuriouslyCurious · September 20, 2017, 10:58pm

Edit: Never mind - MKL_DYNAMIC and OMP_DYNAMIC would not help so removed it from my post

I was wondering about PyTorch internals: if I am using a server with larger number of cores (64), would you recommend setting OMP_NESTED flag to true or false?

https://software.intel.com/en-us/mkl-linux-developer-guide-mkl-dynamic

smth · September 21, 2017, 2:42am

nested should be false.

yilunzhang · February 19, 2020, 6:26pm

@smth Thank you for your solution for improvement. I had the same problem with OP and this largely improved inference speed on the Xeon server that I’m using. However, it is still significantly slower than on a Mac. Are there any other methods that you can think of that can make the gap smaller? Thanks!

gabrieldernbach · March 10, 2020, 10:52am

How do you do your benchmark? Just in case : reading from disk can be a bottleneck if you compare your osx ssd with the server hdd