Speed benchmarking on android?

I am interested to know how fast some of my models run on the CPUs of a Pixel 3 phone. I am a moderately experienced pytorch programmer and linux user, but I have zero experience with android. I am not looking to build an app right now; I just want to know how fast my model runs on this particular phone.

The TensorFlow repo has this barebones android test thingy for timing the latency of a neural net of your choice on an android phone in TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/benchmark/android/README.md

Has anyone made anything similar for pytorch?

2 Likes

We have a binary to do this that can run on your android phone using adb.

To build,

./scripts/build_android.sh \                                                                                                                       
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')

To run the binary, push it to the device using adb and run the following command
./speed_benchmark_torch --model=model.pt --input_dims="1,3,224,224" --input_type=float --warmup=10 --iter 10 --report_pep true

@supriyar do you know if FAI-PEP will integrate with pytorch mobile?

It should be possible, we already output the total network latency in a format that is acceptable by FAI-PEP.
The flow should be similar to existing caffe2 for mobile, but use the speed_benchmark_torch binary instead.

1 Like

Thanks! Is there a certain NDK version that is preferred? I know in TensorFlow, they like using old NDK versions for some reason.

Also, do we need the Android SDK to be visible to PyTorch anywhere?

The instructions for speed_benchmark_torch worked for me on the first try!

If anyone else wants try this on a Pixel 3 android phone, here is the setup that worked for me:

# in bash shell
cd pytorch #where I have my `git clone` of pytorch

export ANDROID_ABI=arm64-v8a
export ANDROID_NDK=/path/to/Android/Sdk/ndk/21.0.6113669/

./scripts/build_android.sh \
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)') \

# speed_benchmark_torch appears in pytorch/build_android/install/bin/speed_benchmark_torch

Next, I followed these instructions to export a resnet18 torchscript model:

#in python
import torch
import torchvision

model = torchvision.models.resnet18(pretrained=True)
model.eval()
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save("resnet18.pt")

Then, I put the files onto the android device

# in bash shell on linux host computer that's plugged into Pixel 3 phone
adb shell mkdir /data/local/tmp/pt
adb push build_android/install/bin/speed_benchmark_torch /data/local/tmp/pt
adb push resnet18.pt /data/local/tmp/pt

And finally I run on the android device

# in bash shell on linux host computer that's plugged into Pixel 3 phone
adb shell  /data/local/tmp/pt/speed_benchmark_torch \
--model  /data/local/tmp/pt/resnet18.pt --input_dims="1,3,224,224" \
--input_type=float --warmup=5 --iter 20

It prints:

Starting benchmark.
Running warmup runs.

Main runs.
Main run finished. Milliseconds per iter: 188.382. Iters per second: 5.30836

Pretty good! I believe resnet18 is about 4 gflop (that is, 2 gmac) per frame, so (4 gmac) / (188 ms) = 21 gflop/s. Not bad for ARM CPUs! (At least I assume it’s executing on the ARM CPUs and not any GPUs or other accelerators.)

Also, this whole process took me about 25 minutes, and everything worked on the first try. I use pytorch day-to-day, but I have very little experience with android, and this was also my first time using torchscript, so I’m surprised and impressed that it was so straightforward.

2 Likes

This thread is very useful and I’m trying to get this working. I can’t get past the step where build_android.sh is run without a bunch of errors. You can view my CMakeError.log here. Does anyone know what’s going on here? Alternatively, if someone could link me their speed_benchmark_torch executable that might also work.

1 Like

@nufsty2, Could I see your build command, please?

And, what operating system do you have on the computer where you are compiling the binary?

@solvingPuzzles I’m in Ubuntu 18.04. I actually just got it working. I had to run
git submodule update --init --recursive
within the pytorch clone as well as run the

./scripts/build_android.sh \
-DBUILD_BINARY=ON \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')

command as sudo -E. I also had to run the python commands as sudo so it could actually write the .pt file.

1 Like

@nufsty2 Way to go!

That’s a bit odd sudo was needed to write the .pt file. I haven’t had to use sudo for that. Perhaps you’re saving the .pt in a write-protected directory?

@solvingPuzzles hmm… maybe? Thanks for the help though!

If someone met this error

abort_message: assertion "terminating with uncaught exception of type c10::Error: PytorchStreamReader failed locating file bytecode.pkl: file not found ()

just save the model using _save_for_lite_interpreter like this:
traced_script_module._save_for_lite_interpreter("resnet18.pt")

It worked with me ^^

Hi, I am trying to run quantized version trained via QAT with config settings as qnnpack, but this seems to give error.

terminating with uncaught exception of type c10::NotImplementedError: Could not run 'quantized::linear' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear' is only available for these backends: [QuantizedCPU, BackendSelect, Functionalize, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta].

but if I run a quantized model without qnnpack setting it runs fine. But how is this happening, since mobile chips are arm64 which requires qnnpack configuration.
p.s.: using Pixel 7 pro device with tensor g2 processor

this error means that your input tensor to the op is somehow coming from unquantized ops. you will need to provide repro.

cc @jerryzh168

Hi @kimishpatel @jerryzh168 I created a new thread related to this Why qnnpack configurations? . Have detailed description there