Very low performance on 64-bit ARM servers compared to Xeon servers (CPU-only)

I recently run MLPerf Benchmark on a few 64-bit ARM servers (e.g., AWS Graviton3, AWS Graviton2, Ampere Altra) and on a Xeon server from AWS (on a c6id.8xlarge instance). In particular, I run the Image Classification benchmark from MLPerf version 2.0 with ssd-mobilenet model on COCO dataset with images of size 300x300. The benchmark outputs the results in “queries per second” (qps). The benchmark supports Tensorflow, PyTorch, and ONNX. Surprisingly, PyTorch exhibits very low qps on ARM, while Tensorflow exhibits expected results. Graviton3 is the most advanced ARM-based server, so I will show you the results (qps) for this compared to Xeon:

                         Graviton3 (c7g.8xlarge)    Xeon (c6id.8xlarge)
   PyTorch (1.13.0a0)      0.9                      34.7
   Tensorflow (2.10.0)    79.9                      83.9

I run the benchmark in Docker (version 20.10.23) on Ubuntu OS (20.045 LTS). In Docker, I use Python 3.7.13 and PyTorch 1.13.0a0.

Steps to reproduce:

tar xf coco-300.tar.bz2
mkdir -p /home/$USER/data
mv coco-300 /home/$USER/data/

  • Prepare MLPerf Docker image and run the benchmark:

git clone https://github.com/mlcommons/inference.git
cd inference
git checkout v2.0
git apply < mlperf-arm64.patch
docker build -f Dockerfile.cpu -t mlperf-cpu .
docker run -v /home/$USER/data/coco-300:/data -it mlperf-cpu

  • Then, in Docker, run:

cd /tmp/inference/vision/classification_and_detection
export DATA_DIR=/data/coco-300
./run_local.sh pytorch ssd-mobilenet cpu

TestScenario.SingleStream qps=0.79, mean=1.2648, time=1295.256, queries=1024, tiles=50.0:1.2092,80.0:1.2159,90.0:1.4371,95.0:1.6403,99.0:1.9828,99.9:2.2888

I profiled the execution with perf (Linux tool) and it seems that PyTorch spends 99% of the time in omp get num procs() which is really strange. Does anyone have any idea why is this happening?

Thank you.

Did you try to play around with e.g. OMP_NUM_THREADS to change the OpenMP threading behavior?

Yes, I tried with OMP_NUM_THREADS set to 8, 16, and 32 (and also not set) – the results are the same: low throughput (below 1 qps).