Why there is still high CPU usage on the paramter server after the parameter server finishes self.optimizer.step() and self.optimizer.zero_grad() (only using CPU) and does nothing?

I built a parameter server (PS) architecutre to train VGG16 by using torch RPC. I ran one PS and one worker on the same host and only used CPU to train and update the model.

During the training I measured the CPU usage of both worker and PS every 0.015s. But I found a weird thing: after the PS finishes doing self.optimizer.step() and self.optimizer.zero_grad(), the PS’s CPU usage still keeps high as before (around 1900%, 19 vCPU cores.) for a non-negligible time.

The logs of the PS’s CPU usage are below.

TIME - JOB NAME - CPU USAGE
2022-11-17 14:29:27,728 - job0 - 197.8
 ### the start of self.optimizer.step() and self.optimizer.zero_grad()
2022-11-17 14:29:27,744 - job0 - 2170.2
 ### the end of self.optimizer.step() and self.optimizer.zero_grad()
 ### the start of time.sleep(1)
2022-11-17 14:29:27,759 - job0 - 2039.9
2022-11-17 14:29:27,775 - job0 - 1972.5
2022-11-17 14:29:27,791 - job0 - 1965.6
2022-11-17 14:29:27,806 - job0 - 2035.0
2022-11-17 14:29:27,822 - job0 - 2036.5
2022-11-17 14:29:27,838 - job0 - 2020.4
2022-11-17 14:29:27,853 - job0 - 2032.5
2022-11-17 14:29:27,869 - job0 - 1909.2
2022-11-17 14:29:27,884 - job0 - 1911.5
2022-11-17 14:29:27,900 - job0 - 1975.1
2022-11-17 14:29:27,915 - job0 - 1974.4
2022-11-17 14:29:27,930 - job0 - 1973.2
2022-11-17 14:29:27,946 - job0 - 1443.4
2022-11-17 14:29:27,961 - job0 - 66.1
2022-11-17 14:29:27,977 - job0 - 0.0
2022-11-17 14:29:27,993 - job0 - 0.0
...
 ### the end of time.sleep(1)
...

After self.optimizer.step() and self.optimizer.zero_grad(), there is still high CPU usage on the PS, even though I call a time.sleep(1). I checked and found there are around 20 threads running, but I have no idea about what they are doing.

I want to know why there is still high CPU usage after parameter update even though I let the PS sleep for a while. Is it because for CPU tensor implementation, there are some asyncrhonous or pipeline operations?

Hi @Zeyu-ZEYU, the observation of the 20 worker threads is because every RPC rank has a threadpool which holds a number of background threads to handle incoming requests. The default is 16 and can be configured in the arguments num_worker_threads Distributed RPC Framework — PyTorch master documentation. Regarding the high CPU utilization, there could be many reasons for that (the threads could be performing some work) so we would need some reproducible example to see. Operations can be asynchronous for RPC.

Hi @H-Huang , thank you very much for the reply.

I performed more measurements and I found RPC is not the reason for high CPU usage. When the worker finishes receiving all updated parameters sent from the PS through sync RPC, the PS’s CPU usage still keeps high as before even the PS does nothing.

Since I use CPU to do parameter update, I think it might be because openMP threads are doing something to improve CPU performance. I think when optimizer.step() and optimizer.zero_grad() are done, the computation should also stop. But the fact is that even if I add time.sleep() right after optimizer.zero_grad(), there is still high CPU usage. This problem can be reproduced in single-machine training. Here I provided a rerpoducible example and my results: GitHub - Zeyu-ZEYU/pytoch-cpu-usage-example: A reproducible example of an issue of PyTorch's CPU usage.

The openMP threads are doing parallel optimization for CPU tensors during parameter udpate. But why when parameter udpate is done, threads are still running for a while (8x update time)? I have no idea on it. Do you have ideas or suggestions? Thanks a lot.