Low CPU Utilization and Slow Inference with PyTorch and KServe

Hello everyone,

I’m currently facing an issue with low CPU utilization and slow inference while using PyTorch and KServe. Here are the details of my setup:

  • I’m running my application on an c5.4xl Amazon Machine.
  • KServe is being used to initialize and scale the machines.
  • I have a minimum of 5 instances allocated for serving.
  • Each pod is allocated 14 cores and 26GB of RAM.
  • I’m running a Hugging Face LLM (Language Model) model.
  • The configuration file I’m using has the following settings:
vmargs=-XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=50.0

Despite hammering the nodes with a load test, the CPU utilization reported by Grafana remains low, ranging between 15-30%. Additionally, each inference request takes approximately 0.9 seconds to process and return to the queue, during which time I would expect CPU usage to be high.

I have tried adjusting the maxWorkers parameter in the model_snapshot configuration, varying it from 1 to 8. Surprisingly, this did not result in any performance enhancements; in fact, it seemed to negatively impact performance.

I would like to understand why the CPU utilization remains low despite the load, and if calling the torch.set_num_threads function is necessary for proper scaling. Any insights or suggestions on improving the performance and utilization of the CPU would be greatly appreciated. I have read through the documentation in these links, and after reviewing with other teammates here we’re becoming confused on how to take the proper steps to fully utilize our hardware:

Thank you in advance for your assistance!

Best regards,