Hi I’m rather new to the DDP, and I found a rather bizarre behavior of my training with DDP.
Say I have 4 GPUs in total. If I run my code on the GPU:0 and GPU:1 and leave the remaining two unoccupied. Then during training the percentage of both GPU would be at maximum 50%. Now the GPU occupation is:
gpu0: process1(50% or lower)
gpu1: process1(50% or lower)
But when I run another training code on the remaining two GPUs, all 4 cards would hit 100% usage and both of the processes run faster than the previous situation. Now the gpu occupation is:
gpu0: process1 (99%)
gpu1: process1 (99%)
gpu2: process2 (99%)
gpu3: process2 (99%)
I experience this on multiple servers and it’s really confusing me. Can anyone help with this?