Consistency between DistributedSampler and DataLoader

junb · November 18, 2021, 9:59am

I’m trying to ensure consistency when creating a DataLoader instance between two cases:

I pass a DistributedSampler to the sampler argument of the DataLoader:
dataloader = DataLoader(..., sampler=DistributedSampler(...))
I directly instantiate the DataLoader:
dataloader = DataLoader(..., sampler=None)

The consistency I’m referring to is when my dataset size is not divisible by the batch size. The documentation specify that I can handle that through the drop_last argument. That way, using either:

dataloader = DataLoader(..., sampler=DistributedSampler(...,shuffle=False, drop_last=True))
dataloader = DataLoader(..., shuffle=False, drop_last=True, sampler=None)

should ensure iterating identically (i.e. same ordering) over the batches with an equal size per batch.

I have several questions linked to my case:

Regarding the way I handle the drop_last argument, is my conclusion correct?
Both DataLoader and DistributedSampler possess shuffle and drop_last arguments. When used in conjunction, which arguments have priority?
Both the documentations of DataLoader and DistributedSampler state that:

drop_last (bool, optional) – if True, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. If False, the sampler will add extra indices to make the data evenly divisible across the replicas. Default: False.

I wonder how the extra indices are added (I assume through uniform sample with replacement over the whole set of indices?).

Is there any way to ensure consistency between my two cases if I don’t want to drop the last batch with uneven size (i.e. use drop_last=False)? I guess I would need to control how the extra indices are added; is it possible?

If some of these points have already been addressed, I apologize and please redirect me towards the specific topics. Thanks in advance.