I am running a Transformers BertForTokenClassification model in a arrch64 architecture pc.In the inference stage, I found that the speed is very slow.
The speed mainly consumes in the following code:
outputs = self.model(input_ids, input_mask, segment_ids)
Then I used the watch command to find that only 1 cpu was used in the inference stage but this machine has 8 cpus.
Generally speaking, pytorch under the x86 architecture will use all cpu by default, so under the arrch64 architecture, what additional settings do I need to use all the cpu?