Hi,
I am trying to train a network with 4-8 gpus (however many is available at the time) on a server. For my network, I can have 100 samples per batch with 4 gpus. This consumes 9-10 GBs of 12 available per gpu (titan xp) (40 cpu cores).
I also try the same experiment on my local machine (gtx 1080 with 12 cpu cores), with smaller batch size.
The problem is, server takes very long to get ready at the beginning of each epoch (and even slower at the beginning of training). I tried setting workers anywhere between 0-25 to no avail.
I am suspicious of other users occupying CPU, hence the delay. There’s always some activity on the server. Is there a way to combat this? Also I use interpreter on the server remotely through SSH via pycharm. Is that likely to create any issues?
Thank you,