Hardware accelerated NNAPI tests

I went through the speed benchmark for Android page to test if our hardware could potentially be used with PyTorch models for accelerating something we’re working on.

Actual results are that the NNAPI models are much much slower:

./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_core-cpu.pt --use_bundled_input=0 --warmup=5 --iter=50
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 97858.2. Iters per second: 10.2189
./speed_benchmark_torch --pthreadpool_size=1 --model=mobilenetv2-quant_core-nnapi.pt --use_bundled_input=0 --warmup=5
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 560976. Iters per second: 1.78261

Quick sanity check… did I run this test right to get expected faster results with the NNapi model?

I compiled PyTorch from sources and just used the latest origin/master (relatively recent). My platform is Android 11 and the hardware is NXP i.mx8m which claims to support hardware acceleration for NNapi (It has a built in NPU).

NXP however only provides demo acceleration for TensorFlow. If I run that demo, the results are a significant speedup for NPU acceleration.

So my goal is to find out which module is failing. Can someone suggest a known working phone or android device that will show PyTorch successfully hardware accelerated? If I can get desired results there, I can then move on to find out if our hardware is not behaving properly and get support from NXP.

Any other debugging advice appreciated.

Thanks for work on a cool project!

1 Like

Hi @dennism Try turning on verbose adb logs and checking if the ops are running on the accelerator?

https://developer.android.com/ndk/guides/neuralnetworks#nnapi_logs

Sounds like some part of the graph is running on the CPU which goes to the NNAPI reference implementation which might be slower than regular Pytorch code