DDP on 2 GPUs and singe GPU have different loss

Hey @sqiangcao

DDP’s loss is local to the input of each process, and it keeps model replicas in sync by averaging gradients across all processes. So in each iteration, each process will have a different loss and depending on the loss function, you might need to tune things like learning rate when moving local training to DDP.

See more discussions here: Average loss in DP and DDP