torch.distributed.reduce(op=torch.distributed.ReduceOp.AVG) behavior when few gpus have input batches while others do not

amoha107 · May 29, 2023, 7:20am

I am new to DDP. I am collecting the training losses from all the GPUs and reducing them at gpu_id=0 with op=torch.distributed.ReduceOp.AVG. What happens in the last iteration where only a few GPUs (not all of them) have batches. How will the torch.distributed.ReduceOp.AVG behave?

Ex: Assume in the last iteration of the epoch, only 2 (out of 8 GPUs) receive data from the DataLoader. Will torch.distributed.ReduceOp.AVG average the losses tensors in the 2 GPUs only as it should? Or will it assume the other 6 GPUs have zero losses and divide the sum by 8? Remember that those 6 GPUs should not have batch in the last iteration from the DistributedSampler as there is simply no more training data.

The same question could be made for the validation DataLoader.

ptrblck · May 29, 2023, 8:15pm

I don’t think this is a valid use case as the DistributedSampler explicitly ensures each rank receives the same amount of data as seen here and here.

If you are somehow forcing this uneven behavior I would expect to see a hang.

amoha107 · May 29, 2023, 9:51pm

Thank you. I am beginning to understand