Using DistributedDataParallel, it will run 8 processes and each process uses a single GPU. I’m wondering how about CPUs? Are they evenly distributed across the 8 processes? Can we specify which process use how many CPUs?
I’ve seen approaches to set the CPU affinity for a GPU device using nvml as described here.
However, I don’t know if and how this approach would work for a general PyTorch process and if you would benefit from it.
DDP was supposed to be used with alternating fw and bw passes. I am a little surprised that it didn’t throw any error. Please let us know the version of PyTorch you are using, we might have recently accidentally disabled the check for some code paths.