Running PyTorch / Caffe2 with multiple cores on mobile

I’ve got a simple model consisting only of convolutions (even no activation between) and I wanted to benchmark it in Caffe2 on my target Android device. I followed https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
but when I run
./speed_benchmark --init_net=model_for_inference-simplified-init-net.pb --net=model_for_inference-simplified-predict-net.pb --iter=1
it runs on single thread so time is:
milliseconds per iter: 16674.4
compared to 2438.7 in TF which is able to use 8 cores

Speed benchmark was built using:
scripts/build_android.sh -DANDROID_ABI=arm64-v8a -DANDROID_TOOLCHAIN=clang -DBUILD_BINARY=ON

On x86 setting OMP_NUM_THREADS=8 helps but not on ARM

As far as I know, the convolutions come from NNPack on Arm, so you might check their docs.
If you find good trick, I’d be very interested for my libtorch android port…

Best regards

Thomas

But I have impression that nnpack doesn’t get called when I run speed_benchmarks. At least that’s what my debug prints and investigating with debugger on x86 show.

That’s strange, if you compile yourself,maybe you need to request it? I haven’t looked at it in detail, but for me (libtorch) the performance without NNPack was terrible on Android.

Best regards

Thomas

I didn’t know that I need to provide engine in the predict net definition.
After running:
for op in predict_net.op:
if op.type == ‘Conv’:
op.engine = ‘NNPACK’
more cores are used (although usage doesn’t cross 400% for 8 cores device)