torch.distributed.reduce(op=torch.distributed.ReduceOp.AVG) behavior when few gpus have input batches while others do not

I am new to DDP. I am collecting the training losses from all the GPUs and reducing them at gpu_id=0 with op=torch.distributed.ReduceOp.AVG. What happens in the last iteration where only a few GPUs (not all of them) have batches. How will the torch.distributed.ReduceOp.AVG behave?

Ex: Assume in the last iteration of the epoch, only 2 (out of 8 GPUs) receive data from the DataLoader. Will torch.distributed.ReduceOp.AVG average the losses tensors in the 2 GPUs only as it should? Or will it assume the other 6 GPUs have zero losses and divide the sum by 8? Remember that those 6 GPUs should not have batch in the last iteration from the DistributedSampler as there is simply no more training data.

The same question could be made for the validation DataLoader.

1 Like

I don’t think this is a valid use case as the DistributedSampler explicitly ensures each rank receives the same amount of data as seen here and here.

If you are somehow forcing this uneven behavior I would expect to see a hang.

1 Like

Thank you. I am beginning to understand :slight_smile: