DDP on 2 GPUs and singe GPU have different loss

In my case, the performance of DDP is lower than single card training. I guess the biggest difference between the two might be the data being fed into the network, not in the same order. So, I tried to control the data fed into the network is exactly the same at the first iteration, but the loss is still different. Can someone please tell me why there is a training difference between DDP and single GPUs?

Hey @sqiangcao

DDP’s loss is local to the input of each process, and it keeps model replicas in sync by averaging gradients across all processes. So in each iteration, each process will have a different loss and depending on the loss function, you might need to tune things like learning rate when moving local training to DDP.

See more discussions here: Average loss in DP and DDP