Hi I’m rather new to the DDP, and I found a rather bizarre behavior of my training with DDP.
Say I have 4 GPUs in total. If I run my code on the GPU:0 and GPU:1 and leave the remaining two unoccupied. Then during training the percentage of both GPU would be at maximum 50%. Now the GPU occupation is:
gpu0: process1(50% or lower)
gpu1: process1(50% or lower)
gpu2: empty
gpu3: empty
But when I run another training code on the remaining two GPUs, all 4 cards would hit 100% usage and both of the processes run faster than the previous situation. Now the gpu occupation is:
How about the per iteration latency? If you feed the same batch size to each DDP instance (i.e., different global batch_size), is using 4 GPU still faster? Please use the elapsed_time API to measure that.
As DDP uses all_reduce to communicate gradients, GPU utilization cannot faithfully represent how busy a GPU is. CUDA would report 100% GPU utilization even if one GPU is block waiting for another peer to join and doing nothing.
I didn’t use elasped_time to measure how long it takes for one iteration but I did use time.time() to calculate the iteration latency. As long as there is one GPU that remains unused, the process runs slower (sometimes 50% slower and sometimes 100% slower).
The weird thing is that it seems my DDP program is influenced by other programs (they are not necessarily DDPs) running on other GPUs, in a counterintuitive way: normally we expect that programs are competing for the computational resources, but here more programs bring better performance…
(This thing didn’t happen to my colleague’s non-DDP program so I guess it must have something to do with the DDP).