I’ve got a simple model consisting only of convolutions (even no activation between) and I wanted to benchmark it in Caffe2 on my target Android device. I followed https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
but when I run
./speed_benchmark --init_net=model_for_inference-simplified-init-net.pb --net=model_for_inference-simplified-predict-net.pb --iter=1
it runs on single thread so time is:
milliseconds per iter: 16674.4
compared to 2438.7 in TF which is able to use 8 cores
Speed benchmark was built using:
scripts/build_android.sh -DANDROID_ABI=arm64-v8a -DANDROID_TOOLCHAIN=clang -DBUILD_BINARY=ON
On x86 setting OMP_NUM_THREADS=8 helps but not on ARM
As far as I know, the convolutions come from NNPack on Arm, so you might check their docs.
If you find good trick, I’d be very interested for my libtorch android port…
But I have impression that nnpack doesn’t get called when I run speed_benchmarks. At least that’s what my debug prints and investigating with debugger on x86 show.
That’s strange, if you compile yourself,maybe you need to request it? I haven’t looked at it in detail, but for me (libtorch) the performance without NNPack was terrible on Android.
I didn’t know that I need to provide engine in the predict net definition.
After running:
for op in predict_net.op:
if op.type == ‘Conv’:
op.engine = ‘NNPACK’
more cores are used (although usage doesn’t cross 400% for 8 cores device)