Low CPU Utilization and Slow Inference with PyTorch and KServe

Brooks_Ryba · June 21, 2023, 6:09pm

Hello everyone,

I’m currently facing an issue with low CPU utilization and slow inference while using PyTorch and KServe. Here are the details of my setup:

I’m running my application on an c5.4xl Amazon Machine.
KServe is being used to initialize and scale the machines.
I have a minimum of 5 instances allocated for serving.
Each pod is allocated 14 cores and 26GB of RAM.
I’m running a Hugging Face LLM (Language Model) model.
The configuration file I’m using has the following settings:

vmargs=-XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=50.0
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
service_envelope=kservev2
model_store=/mnt/models/model-store
install_py_dep_per_model=true
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"t5xl-llm":{"1.0":{"defaultVersion":true,"marName":"model.mar","minWorkers":1,"maxWorkers":2,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":300}}}}

Despite hammering the nodes with a load test, the CPU utilization reported by Grafana remains low, ranging between 15-30%. Additionally, each inference request takes approximately 0.9 seconds to process and return to the queue, during which time I would expect CPU usage to be high.

I have tried adjusting the maxWorkers parameter in the model_snapshot configuration, varying it from 1 to 8. Surprisingly, this did not result in any performance enhancements; in fact, it seemed to negatively impact performance.

I would like to understand why the CPU utilization remains low despite the load, and if calling the torch.set_num_threads function is necessary for proper scaling. Any insights or suggestions on improving the performance and utilization of the CPU would be greatly appreciated. I have read through the documentation in these links, and after reviewing with other teammates here we’re becoming confused on how to take the proper steps to fully utilize our hardware:

Thank you in advance for your assistance!

Best regards,
Brooks