I built a parameter server (PS) architecutre to train VGG16 by using torch RPC. I ran one PS and one worker on the same host and only used CPU to train and update the model.
During the training I measured the CPU usage of both worker and PS every 0.015s. But I found a weird thing: after the PS finishes doing self.optimizer.step() and self.optimizer.zero_grad(), the PS’s CPU usage still keeps high as before (around 1900%, 19 vCPU cores.) for a non-negligible time.
The logs of the PS’s CPU usage are below.
TIME - JOB NAME - CPU USAGE
2022-11-17 14:29:27,728 - job0 - 197.8
### the start of self.optimizer.step() and self.optimizer.zero_grad()
2022-11-17 14:29:27,744 - job0 - 2170.2
### the end of self.optimizer.step() and self.optimizer.zero_grad()
### the start of time.sleep(1)
2022-11-17 14:29:27,759 - job0 - 2039.9
2022-11-17 14:29:27,775 - job0 - 1972.5
2022-11-17 14:29:27,791 - job0 - 1965.6
2022-11-17 14:29:27,806 - job0 - 2035.0
2022-11-17 14:29:27,822 - job0 - 2036.5
2022-11-17 14:29:27,838 - job0 - 2020.4
2022-11-17 14:29:27,853 - job0 - 2032.5
2022-11-17 14:29:27,869 - job0 - 1909.2
2022-11-17 14:29:27,884 - job0 - 1911.5
2022-11-17 14:29:27,900 - job0 - 1975.1
2022-11-17 14:29:27,915 - job0 - 1974.4
2022-11-17 14:29:27,930 - job0 - 1973.2
2022-11-17 14:29:27,946 - job0 - 1443.4
2022-11-17 14:29:27,961 - job0 - 66.1
2022-11-17 14:29:27,977 - job0 - 0.0
2022-11-17 14:29:27,993 - job0 - 0.0
...
### the end of time.sleep(1)
...
After self.optimizer.step() and self.optimizer.zero_grad(), there is still high CPU usage on the PS, even though I call a time.sleep(1). I checked and found there are around 20 threads running, but I have no idea about what they are doing.
I want to know why there is still high CPU usage after parameter update even though I let the PS sleep for a while. Is it because for CPU tensor implementation, there are some asyncrhonous or pipeline operations?