I’m making some modifications to MoCo, which runs pytorch multiprocessing. Running the default code leads to very consistent iteration times (between 0.1s and 0.13s). After my modification (essentially adding some optimizable conditional normalization layers as input to the ResNet) the runtime has become stochastic, a bit more than half the iterations take ~0.5-0.6s (I was expecting this increase), but some take 1.5-2s.
The dist-backend is nccl and I’m using 7 GPUs, although the same issue appears when using the default 8 GPUs.
I’m wondering whether this stochasticity implies some bug in my implementation and, if so, what’s the best way of debugging distributed models.