Hi, I’m new to distributed data parallel (DDP) module.
I have some questions one my model in which I use syncBatchNorm start to train on 2GPU in a single node, it seems to use a lot of CPU cores, and seems limited by CPU.
I tried to profile my code and the two most CPU time consuming operations are “to” (which is called around 4000times ) and “syncBatchNorm”.
So, I don’t know if it’s normal ? And if any optimization are possible ?
Thanks for your help !