Distributed data parallel slowed by cpu

Hi, I’m new to distributed data parallel (DDP) module.

I have some questions one my model in which I use syncBatchNorm start to train on 2GPU in a single node, it seems to use a lot of CPU cores, and seems limited by CPU.

I tried to profile my code and the two most CPU time consuming operations are “to” (which is called around 4000times ) and “syncBatchNorm”.

So, I don’t know if it’s normal ? And if any optimization are possible ?

Thanks for your help !

Do you have a code snippet that reproduces the issue? From the description I guess most of the profiler results are from moving your model and parameters from CPU to GPU, instead of doing the actual training. Can you profile it after move the model to GPU? The model should be converted to GPU before wrapping it with DDP, see the DDP tutorial