CPU utilization of main process is very high!

Hi there, I am noticing that the CPU utilization of my main process is much much higher than that of any of my dataloader worker processes even though I am doing almost no computation in the main process and merely pushing data from host to device. Moreover, the larger I set the value of num_workers to, the larger this value of main process CPU utilization is. Is this expected behavior and/or can anybody explain why this is happening?

You might want to check how much CPU load is incurred from the host-to-device copies via a microbenchmark, as this could be expected especially if nonblocking copies are used.

I am using non_blocking calls in this case and it’s still very high

And it turns out, the CPU utilization for the main process reaches over 100% even when I use just a single background worker in my dataloader, when the only true CPU involvement in the main process is host-device transfer of tensors. For context, every iteration involves 1.1GB of data transfer from host to device at about 10GB/s bandwidth which is still much smaller than the peak theoretical bandwidth for the PCIe GH100.

@eqy Any ideas as to why this might be happening?