Hey @111344
If each DDP (DistributedDataParallel) process is using the same batch size as you passed to DataParallel, then I think you need to divide the reduced loss by world_size
. Otherwise, you are summing together losses from world_size
batches.
Another thing is that batch size and learning rate might need to change when switched to DDP. Check out the discussions below:
- Should we split batch_size according to ngpu_per_node when DistributedDataparallel
- Is average the correct way for the gradient in DistributedDataParallel with multi nodes?
And this briefly explains how DDP works: https://pytorch.org/docs/master/notes/ddp.html