Mobile Thread number for inference

dschnell · January 27, 2022, 6:14pm

Hi, when I inference my qnnpacked models on Android, I see when tracing, that the max. CPU usage is only ~50%. I tried to increase the number of threads via set_num_threads(4), but that didn’t help. Is there any way to increase the inference speed by utilizing more CPU cores ?

beback4u · January 27, 2022, 6:48pm

@kimishpatel Could you please take a look at this?

kimishpatel · January 31, 2022, 7:32pm

I think increasing the # of threads helps you only as much as the # of cores matches that. Particularly more modern phones architecture is more heterogenous where you have some variation of small, medium and large cores. What is the phone you are using? Also what is the model? If you are spending large part of the runtime outside of say qnnpack backed operators then also utilization might be low.

I would say simpleperf would be a good tool to get handle on this. Are you familiar with that?

dschnell · February 2, 2022, 2:48pm

Yeah I used the Android profiler i.e. Perfetto for tracing the CPU utilization. My device is a Pixel6 with 8 cores. I use the Java API and because I couldn’t find an appropriate call for setting the number of CPU’s there, I have manually edited the torchscript code to call the set_num_threads(4) or set_num_threads(8).

What is the “official” way to influence the number of threads used for inference ? I mean from POV of the Java development ?

dschnell · February 2, 2022, 2:50pm

The model I am using is a fastspeech2 acoustic model in combination with Melgan vocoder model. The acoustic model inference is relatively fast. The vocoder model is the real bottleneck.

kimishpatel · February 2, 2022, 4:19pm

I am gonna suggest using 2 threads. The octacore on pixel6 has 2 X1 2 A76 and 4 A55. 4 A55 are pretty slow. So you may get better speedup with 2 core. A few follow up questions:

Is your model quantized?
If not quantized are you using optimize_for_mobile to transform your model?

dschnell · February 2, 2022, 4:36pm

The acoustic model is quantized, the vocoder model not, but both are massaged via optimize_for_mobile. As for the thread numbers: you are probably right that the slow ones are only in the way. But then again: do we have any influence on which cores inferencing takes place ? Or is the scheduler picking up always the fastest cores anyway ? Or does PyTorch mobile automatically set CPU affinity ?

kimishpatel · February 2, 2022, 4:54pm

Generally we dont have control over which cores thread get assigned to. XNNPACK for fp32 compute has the smarts where it picks different implementation depending on which core thread is mapped to but that still wont address the issue we see here. In terms of latency I think sticking to 2 threads will be the best. However in terms of utilization it is not the best. Addressing this is slightly more complex.
I have another set of followups:

Is your end goal maximizing resource utilization in such a way that it reduces latency?
Can you open issue on pytorch github and assign it to me.

I do think this is an interesting problem that we may need to look into.

dschnell · February 2, 2022, 5:11pm

Yes that’s exactly what we are trying to do. In our case of TTS models, the user is quite sensitive to latency.
I understand that this highly depends on the mobile phone type used. On some of the phones, we can utilize the GPU or even a Tensor as in the Pixel6 case. On others, we’d need the CPU’s. We probably even need to provide different models depending on the phone.

I wanted to know, how fast our models would be on CPU’s, when running on an up-to-date current phone. Using the GPU or Tensor on that phone would probably dramatically decrease the latency.

I will open an issue and assign it to you.

dschnell · February 2, 2022, 5:20pm

Btw. is there a specification about what exactly the current PyTorch script capabilities of PyTorch moble supports ? We have run into some issues when converting our models from PyTorch. Also, if we try to convert these to NNAPI there were some non-obvious issues. I don’t want to capture this thread for a different subject, but maybe you can give me some pointers to educate myself better ?

kimishpatel · February 2, 2022, 5:51pm

If you can split your model in quantized and fp32 parts. Then you can try to use NNAPI workflow to lower quantized part. (maybe even fp32). Have you tried that? (Beta) Convert MobileNetV2 to NNAPI — PyTorch Tutorials 1.10.1+cu102 documentation

dschnell · February 3, 2022, 9:57am

I have opened an issue on github: https://github.com/pytorch/pytorch/issues/72252. I cannot assign it, though.