Hi, Im training on 4 GPUs in a ddp manner (bot pytorch ddp and accelerate package).
During training I see peculiar phenomena after N training steps, one of the gpus utilization is dropping, out of no where, look at the bottom left plot
It always happens around the same step of the training, but there is nothing unique in that step, or before.
I did some more experiments, and its seems that the data time is the same, the forward time is the same, but the backward time is getting longer.
But the weird thing is that Im not seeing any incremental increase or decrease, but a sudden one.
Offcourse that the utilization drop is effecting the training time from fast to un-trainable.
Any clue how can I start investigate this issue?