I have installed PyTorch on several different machines, and I am consistently seeing the OS X CPU-only performance outperforming PyTorch on other platforms, even though the other machines have better CPUs.
I was using this post as a point of reference, as the symptoms sound similar. However, I ultimately found that while installing with the iomp5 flag (which now appears to be integrated into the master branch) does indeed produce faster matrix multiplication on the server, overall performance still seems to be slower.
My target application is kind of a monster, so my first step at isolating the issue was trying out some of the PyTorch examples. For example, I ran the mnist example on two different machines, and the example finished 10 epochs almost twice as fast on my MacBook Pro than on the server. Here are the specifications for each machine (very similar to the post linked above):
If you have a Xeon server with 48 hyperthreads (2 sockets, 24 physical cores), you will almost surely have worse perf by default than i7 with 8 cores.
The reason here is that Intel MKL and Intel OpenMP which PyTorch uses to parallelize it’s CPU code, try to use all possible cores by default.
However, using all cores is not great in many cases. And its especially not great when you have 12 cores split across 2 sockets, and oversubscribed with hyperthreads.
Try to run your python program like this on your Linux machine:
OMP_NUM_THREADS=8 MKL_NUM_THREADS=8 python foo.py
Try different values than 8 from 1 to 12. Going more than 12 will surely be bad perf, unless you have very large workloads.
Edit: Never mind - MKL_DYNAMIC and OMP_DYNAMIC would not help so removed it from my post
I was wondering about PyTorch internals: if I am using a server with larger number of cores (64), would you recommend setting OMP_NESTED flag to true or false?
@smth Thank you for your solution for improvement. I had the same problem with OP and this largely improved inference speed on the Xeon server that I’m using. However, it is still significantly slower than on a Mac. Are there any other methods that you can think of that can make the gap smaller? Thanks!