Properly implementing DDP in training loop with cleanup, barrier, and its expected output

@rvarm1 - thanks for opening up the GitHub issue on this!

Before I forget, the solution of net.to(f'cuda:{args.local_rank}') can be found here in the event that the context of the solution between posts becomes relevant.

Regarding your response to your 1st paragraph (“Regarding speedup…not linear due to the communication overhead”), I’m still a little confused; I understand that DDP aims “to speed up training by distributing the original dataset across workers,” but I’m still uncertain if the following output is expected:

Running basic DDP example on rank 0.
Running basic DDP example on rank 1.
Running basic DDP example on rank 2.
Running basic DDP example on rank 3.
[1,  2000] loss: 2.181
[1,  2000] loss: 2.187
[1,  2000] loss: 2.180
[1,  2000] loss: 2.187
[2,  2000] loss: 1.740
[2,  2000] loss: 1.737
[2,  2000] loss: 1.739
Finished Training
[2,  2000] loss: 1.737
Finished Training
Finished Training
Finished Training

While I understand that each GPU is handling it’s own input (and therefore output), does this indeed mean that we have 4 separate losses? If so, how does PyTorch handle each loss and form a “final loss” after each minibatch?

Also I know that PyTorch averages the loss across each minibatch, but we’re not dealing with minibatches here (and in such a scenario each GPU would calculate its own minibatch), we’re dealing with our inputs that are evenly distributed across the GPUs (so therefore data from one minibatch may be placed on 2 different GPUs). Does DPP automatically take care of this instance where data from one minibatch may be placed on 2 different GPUs and therefore the iterations and resulting loss are communicated and calculated correctly? Or Is my understanding incorrect?