Data Distributed Parallel runs faster when all the cards are occupied

Hi I’m rather new to the DDP, and I found a rather bizarre behavior of my training with DDP.

Say I have 4 GPUs in total. If I run my code on the GPU:0 and GPU:1 and leave the remaining two unoccupied. Then during training the percentage of both GPU would be at maximum 50%. Now the GPU occupation is:

gpu0: process1(50% or lower)
gpu1: process1(50% or lower)
gpu2: empty
gpu3: empty

But when I run another training code on the remaining two GPUs, all 4 cards would hit 100% usage and both of the processes run faster than the previous situation. Now the gpu occupation is:

gpu0: process1 (99%)
gpu1: process1 (99%)
gpu2: process2 (99%)
gpu3: process2 (99%)

I experience this on multiple servers and it’s really confusing me. Can anyone help with this?

How about the per iteration latency? If you feed the same batch size to each DDP instance (i.e., different global batch_size), is using 4 GPU still faster? Please use the elapsed_time API to measure that.

As DDP uses all_reduce to communicate gradients, GPU utilization cannot faithfully represent how busy a GPU is. CUDA would report 100% GPU utilization even if one GPU is block waiting for another peer to join and doing nothing.

Thanks for your reply!

I didn’t use elasped_time to measure how long it takes for one iteration but I did use time.time() to calculate the iteration latency. As long as there is one GPU that remains unused, the process runs slower (sometimes 50% slower and sometimes 100% slower).

The weird thing is that it seems my DDP program is influenced by other programs (they are not necessarily DDPs) running on other GPUs, in a counterintuitive way: normally we expect that programs are competing for the computational resources, but here more programs bring better performance…

(This thing didn’t happen to my colleague’s non-DDP program so I guess it must have something to do with the DDP).