Slow predict time for built pytorch in Raspberry 3


I just built from source the pytorch v1.0.1 using the following commands (did it twice, once inside a Raspberry Pi 3+ and a second time with a qemu emulating a armv7, yelding the same results):

export NO_CUDA=1
export NO_MKLDNN=1 
export CFLAGS="-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard  -O3"

python3 bdist_wheel

I did got a valid whell, but after running a test script for evaluating a difference (comparing to keras+Tf) in performance of some predictions, I realized that the pytorch prediction was roughly twice as slow as the Keras+TF prediction time.

What is weird, is that when I do the same test (multiple times) in a x89 processor (using existent pytorch wheels) the pytorch predictions show itself slightly faster than the Keras+TF.

Is there anything I need to do to get the built pytorch run predictions faster?
I did tweak those compile flags and runtime flags (Like NUM_CPUS=4 && OMP_NUM_THREADS=4 && MKL_NUM_THREADS=4) but the best result I got was the “twice as slow” in predictions.

(Needless to say, but of course the model used for testing TF and Pytorch were the same, with the same number of parameters and the input alike. And the pytorch model was exported with jit)

One thing to have on arm is NNPack. It made a huge difference for me on Android arm.

Best regards


But isn’t the NNPack built along with pytorch (when I dont set NO_NNPACK) (with the infamous Brace y=urself, we are building NNPACK)?

Here is my cmake command:

cmake /home/pi/pytorch -DPYTHON_EXECUTABLE=/usr/bin/python3 -DPYTHON_LIBRARY=/usr/lib/ -DPYTHON_INCLUDE_DIR=/usr/include/python3.5m 
-DUSE_NUMPY=ON -DNUMPY_INCLUDE_DIR=/usr/local/lib/python3.5/dist-packages/numpy/core/include -DUSE_SYSTEM_NCCL=OFF -DNCCL_INCLUDE_DIR= -DNCCL_ROOT_DIR=
 -DNCCL_EXTERNAL=0 -DCMAKE_INSTALL_PREFIX=/home/pi/pytorch/torch/lib/tmp_install '-DCMAKE_C_FLAGS= -march=armv7-a -mtune=cortex-a8 -mfpu=neon
  -mfloat-abi=hard -O3' '-DCMAKE_CXX_FLAGS= -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard -O3' '-DCMAKE_EXE_LINKER_FLAGS= 
  -Wl,-rpath,$ORIGIN  -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard -O3' '-DCMAKE_SHARED_LINKER_FLAGS= -Wl,-rpath,$ORIGIN 
   -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard -O3' -DTHD_SO_VERSION=1 -DCMAKE_PREFIX_PATH=/usr/lib/python3/dist-packages

Hello tom,
I think the NNPack is already built alongside pytorch, as seen in my previous post.

I don’t have the hardware to measure this, so I cannot really comment. It would also depend on the architecture of your net.

Best regards