How can I use the Distributed instead of dataparallel

Hey @111344

If each DDP (DistributedDataParallel) process is using the same batch size as you passed to DataParallel, then I think you need to divide the reduced loss by world_size. Otherwise, you are summing together losses from world_size batches.

Another thing is that batch size and learning rate might need to change when switched to DDP. Check out the discussions below:

  1. Should we split batch_size according to ngpu_per_node when DistributedDataparallel
  2. Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

And this briefly explains how DDP works: https://pytorch.org/docs/master/notes/ddp.html