Bad inference performance on some CPUs

I measured some CPU prediction performance and I got a huge difference in prediction times that I don’t really understand.

I am using this Residual Network with 12 hidden layers for prediction:

With Pytorch 1.0 (precompiled, no builds from source) a single prediction takes on average
0.022s (no VM Windows 10) or 0.1s (Ubuntu 18.04 VM) on an Intel Core i7-4770K @ 4.2 Ghz and
0.038s (no VM Windows Server 2016 Datacenter) on an Intel Xeon X5680 @ 3.33Ghz but
6.85s on an Opteron 6136 (Ubuntu 18.04 on a VM).
I also got nearly that slow values on an Xeon X5355 (Ubuntu 18.04 on a VM).

No I am trying to figure out what’s the reason for THAT bad performance in comparison to the Intel i7.
Is it because SSE4.1 or 4.2 is not supported?

PyTorch uses MKL-DNN ( for CPU convolutions. It’s optimized for Haswell and newer architectures (circa 2013+). I’ve never tried it on much older processors.

Two suggestions:

  1. Run your program under perf top or perf record to see where the time is spent
  2. Try adjusting OMP_NUM_THREADS. Try setting it to 1 or the number of unused cores (and values in between). Sometimes oversubscription (too many threads) can be a problem.

Hey @colesbury, thanks for your quick tips. They were very helpful. It makes sense to me hat some of those CPUs were too old for optimizations. And OMP_NUM_THREADS=1 allowed me to improve my programs performance significantly. Cheers!