How to use more than one cpu to run pytorch inference in the arrch64 architecture

I am running a Transformers BertForTokenClassification model in a arrch64 architecture pc.In the inference stage, I found that the speed is very slow.
The speed mainly consumes in the following code:

   outputs = self.model(input_ids, input_mask, segment_ids)

Then I used the watch command to find that only 1 cpu was used in the inference stage but this machine has 8 cpus.

Generally speaking, pytorch under the x86 architecture will use all cpu by default, so under the arrch64 architecture, what additional settings do I need to use all the cpu?

I don’t know how you’ve installed PyTorch, but would assume you’ve built it from source (or did you find pip wheels for aarch64?).
If so, did you install NEON on your system and was PyTorch detecting it during the build? This should use vectorized operations and could yield a speedup. Also, OpenBLAS should be able to use all your cores.

Hi,thank you for your reply.I found the cause of the problem. After I upgraded the pytorch version from 1.4.0 to 1.8.2, the problem was solved. :grin: