I’m trying to ensure consistency when creating a DataLoader instance between two cases:
- I pass a DistributedSampler to the
sampler
argument of the DataLoader:
dataloader = DataLoader(..., sampler=DistributedSampler(...))
- I directly instantiate the DataLoader:
dataloader = DataLoader(..., sampler=None)
The consistency I’m referring to is when my dataset size is not divisible by the batch size. The documentation specify that I can handle that through the drop_last
argument. That way, using either:
dataloader = DataLoader(..., sampler=DistributedSampler(...,shuffle=False, drop_last=True))
dataloader = DataLoader(..., shuffle=False, drop_last=True, sampler=None)
should ensure iterating identically (i.e. same ordering) over the batches with an equal size per batch.
I have several questions linked to my case:
-
Regarding the way I handle the
drop_last
argument, is my conclusion correct? -
Both DataLoader and DistributedSampler possess
shuffle
anddrop_last
arguments. When used in conjunction, which arguments have priority? -
Both the documentations of DataLoader and DistributedSampler state that:
drop_last (bool, optional) – if
True
, then the sampler will drop the tail of the data to make it evenly divisible across the number of replicas. IfFalse
, the sampler will add extra indices to make the data evenly divisible across the replicas. Default:False
.
I wonder how the extra indices are added (I assume through uniform sample with replacement over the whole set of indices?).
- Is there any way to ensure consistency between my two cases if I don’t want to drop the last batch with uneven size (i.e. use
drop_last=False
)? I guess I would need to control how the extra indices are added; is it possible?
If some of these points have already been addressed, I apologize and please redirect me towards the specific topics. Thanks in advance.