Hello everyone,
I’m currently facing an issue with low CPU utilization and slow inference while using PyTorch and KServe. Here are the details of my setup:
- I’m running my application on an c5.4xl Amazon Machine.
- KServe is being used to initialize and scale the machines.
- I have a minimum of 5 instances allocated for serving.
- Each pod is allocated 14 cores and 26GB of RAM.
- I’m running a Hugging Face LLM (Language Model) model.
- The configuration file I’m using has the following settings:
vmargs=-XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=50.0
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
service_envelope=kservev2
model_store=/mnt/models/model-store
install_py_dep_per_model=true
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"t5xl-llm":{"1.0":{"defaultVersion":true,"marName":"model.mar","minWorkers":1,"maxWorkers":2,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":300}}}}
Despite hammering the nodes with a load test, the CPU utilization reported by Grafana remains low, ranging between 15-30%. Additionally, each inference request takes approximately 0.9 seconds to process and return to the queue, during which time I would expect CPU usage to be high.
I have tried adjusting the maxWorkers
parameter in the model_snapshot
configuration, varying it from 1 to 8. Surprisingly, this did not result in any performance enhancements; in fact, it seemed to negatively impact performance.
I would like to understand why the CPU utilization remains low despite the load, and if calling the torch.set_num_threads
function is necessary for proper scaling. Any insights or suggestions on improving the performance and utilization of the CPU would be greatly appreciated. I have read through the documentation in these links, and after reviewing with other teammates here we’re becoming confused on how to take the proper steps to fully utilize our hardware:
- 20. Serving large models with Torchserve — PyTorch/Serve master documentation
- Grokking PyTorch Intel CPU performance from first principles — PyTorch Tutorials 2.0.1+cu117 documentation
Thank you in advance for your assistance!
Best regards,
Brooks