Utilization of one GPU suddently drops

Hi, Im training on 4 GPUs in a ddp manner (bot pytorch ddp and accelerate package).
During training I see peculiar phenomena after N training steps, one of the gpus utilization is dropping, out of no where, look at the bottom left plot

It always happens around the same step of the training, but there is nothing unique in that step, or before.
I did some more experiments, and its seems that the data time is the same, the forward time is the same, but the backward time is getting longer.
But the weird thing is that Im not seeing any incremental increase or decrease, but a sudden one.

Offcourse that the utilization drop is effecting the training time from fast to un-trainable.

Any clue how can I start investigate this issue?

You probably checked, but could it be heat?
Old story, when we wrote the book, one of us wondered why his training would be slower with 2 GPU than with one. Turned out, with 2 crunching data, the thermal throttling would kick in.

Best regards