I am new to DDP. I am collecting the training losses from all the GPUs and reducing them at gpu_id=0 with op=torch.distributed.ReduceOp.AVG. What happens in the last iteration where only a few GPUs (not all of them) have batches. How will the torch.distributed.ReduceOp.AVG behave?
Ex: Assume in the last iteration of the epoch, only 2 (out of 8 GPUs) receive data from the DataLoader. Will torch.distributed.ReduceOp.AVG average the losses tensors in the 2 GPUs only as it should? Or will it assume the other 6 GPUs have zero losses and divide the sum by 8? Remember that those 6 GPUs should not have batch in the last iteration from the DistributedSampler as there is simply no more training data.
The same question could be made for the validation DataLoader.