I’m trying to use a Roberta model for inference on CPU in production environment. The model is trained in python and then exported to a TorchScript model for inference in Java (using the libtorch library).
During Inference I see that multiple threads and cores are being utilized. when running on a machine with 40 cores and running htop I noticed that 20 threads are running and the CPU utilization is at around 1700%. Also running on a 16 core machine I get around 450% CPU utilization.
According to this doc: CPU threading and TorchScript inference — PyTorch 1.12 documentation the default number of threads used by pytorch for intra-op parallelism is the number of CPU cores.
Since I don’t observe these numbers using htop (and other monitoring mechanisms) I wonder whether scripted models use a different default configuration then the one mentioned in the doc?
Can someone please shed light on this for me? Also any suggestion on how to speed up inference on CPU under libtorch would be greatly appreciated!