Looks like DistributedSampler ensures all ranks get same number of batches at least when drop_last is True. So I believe we don’t need DDP join in that setting to ensure no hangs in allreduce. Is my understanding correct?
Looks like DistributedSampler ensures all ranks get same number of batches at least when drop_last is True. So I believe we don’t need DDP join in that setting to ensure no hangs in allreduce. Is my understanding correct?