Dataloader very slow on multi-gpu setting


I am trying to train a network with 4-8 gpus (however many is available at the time) on a server. For my network, I can have 100 samples per batch with 4 gpus. This consumes 9-10 GBs of 12 available per gpu (titan xp) (40 cpu cores).

I also try the same experiment on my local machine (gtx 1080 with 12 cpu cores), with smaller batch size.

The problem is, server takes very long to get ready at the beginning of each epoch (and even slower at the beginning of training). I tried setting workers anywhere between 0-25 to no avail.

I am suspicious of other users occupying CPU, hence the delay. There’s always some activity on the server. Is there a way to combat this? Also I use interpreter on the server remotely through SSH via pycharm. Is that likely to create any issues?

Thank you,

If you wanna check the situation of your CPU you can use the command top to see what programmings are running. You will also see the privilege level by this command.

Maybe you should also try watch nvidia-smi to see if your GPUs are really in use.

Hi, I did both. It takes a while for GPUs to get to capacity. Then, whenever I check CPU usage by htop, there’s always someone utilizing CPU (but I don’t quite understand this, as most of the time multiple users are at 100% CPU utilization.