I recently run MLPerf Benchmark on a few 64-bit ARM servers (e.g., AWS Graviton3, AWS Graviton2, Ampere Altra) and on a Xeon server from AWS (on a c6id.8xlarge instance). In particular, I run the Image Classification benchmark from MLPerf version 2.0 with ssd-mobilenet model on COCO dataset with images of size 300x300. The benchmark outputs the results in “queries per second” (qps). The benchmark supports Tensorflow, PyTorch, and ONNX. Surprisingly, PyTorch exhibits very low qps on ARM, while Tensorflow exhibits expected results. Graviton3 is the most advanced ARM-based server, so I will show you the results (qps) for this compared to Xeon:
Graviton3 (c7g.8xlarge) Xeon (c6id.8xlarge) PyTorch (1.13.0a0) 0.9 34.7 Tensorflow (2.10.0) 79.9 83.9
I run the benchmark in Docker (version 20.10.23) on Ubuntu OS (20.045 LTS). In Docker, I use Python 3.7.13 and PyTorch 1.13.0a0.
Steps to reproduce:
Download the patch for MLPerf from here: https://github.com/dloghin/arm-cloud-bench/blob/main/mlperf/mlperf-arm64.patch
Download the pre-processed COCO dataset (please message me for the link or follow the tutorial from MLPerf to generate them). Once you have the archive, run:
tar xf coco-300.tar.bz2
mkdir -p /home/$USER/data
mv coco-300 /home/$USER/data/
- Prepare MLPerf Docker image and run the benchmark:
git clone https://github.com/mlcommons/inference.git
git checkout v2.0
git apply < mlperf-arm64.patch
docker build -f Dockerfile.cpu -t mlperf-cpu .
docker run -v /home/$USER/data/coco-300:/data -it mlperf-cpu
- Then, in Docker, run:
./run_local.sh pytorch ssd-mobilenet cpu
TestScenario.SingleStream qps=0.79, mean=1.2648, time=1295.256, queries=1024, tiles=50.0:1.2092,80.0:1.2159,90.0:1.4371,95.0:1.6403,99.0:1.9828,99.9:2.2888
I profiled the execution with perf (Linux tool) and it seems that PyTorch spends 99% of the time in omp get num procs() which is really strange. Does anyone have any idea why is this happening?