How to use more than one cpu to run pytorch inference in the arrch64 architecture

qingye_meng · October 19, 2021, 1:42am

I am running a Transformers BertForTokenClassification model in a arrch64 architecture pc.In the inference stage, I found that the speed is very slow.
The speed mainly consumes in the following code:

   outputs = self.model(input_ids, input_mask, segment_ids)

Then I used the watch command to find that only 1 cpu was used in the inference stage but this machine has 8 cpus.

Generally speaking, pytorch under the x86 architecture will use all cpu by default, so under the arrch64 architecture, what additional settings do I need to use all the cpu?

ptrblck · October 19, 2021, 4:48am

I don’t know how you’ve installed PyTorch, but would assume you’ve built it from source (or did you find pip wheels for aarch64?).
If so, did you install NEON on your system and was PyTorch detecting it during the build? This should use vectorized operations and could yield a speedup. Also, OpenBLAS should be able to use all your cores.

qingye_meng · October 19, 2021, 5:35am

Hi，thank you for your reply.I found the cause of the problem. After I upgraded the pytorch version from 1.4.0 to 1.8.2, the problem was solved.