Varying iteration time in when using pytorch distributed


I’m making some modifications to MoCo, which runs pytorch multiprocessing. Running the default code leads to very consistent iteration times (between 0.1s and 0.13s). After my modification (essentially adding some optimizable conditional normalization layers as input to the ResNet) the runtime has become stochastic, a bit more than half the iterations take ~0.5-0.6s (I was expecting this increase), but some take 1.5-2s.

The dist-backend is nccl and I’m using 7 GPUs, although the same issue appears when using the default 8 GPUs.

I’m wondering whether this stochasticity implies some bug in my implementation and, if so, what’s the best way of debugging distributed models.


Hey @alet, can you share the implementation of modified model? Especially the “optimizable conditional normalization layers”. Would I be correct if I assume the program is using DDP both before and after the modification?