DistributedDataParallel training not efficient

The epochs in a training with 4 GPUs are 4x shorter (because each GPU works with only 1/4 of the dataset) but we need 4x the amount of epochs to get the same result.

With DataParallel this issue doesn’t occur.

Putting these two together, looks like the loss function might play a role here. With DDP, the gradient synchronization only occurs during the backward pass after loss computation, which means that each process/GPU independently computes the loss using its local input split. In contrast, DataParallel does not have this problem, as forward output is first gathered and then the loss is computed over all input data in that iteration. Will this make a difference in your application?

Another thing is that, when switching from single GPU to DDP-based multi-GPU training, you might need to tune the configs like learning rate to get the best result. See discussions in this post: Should we split batch_size according to ngpu_per_node when DistributedDataparallel