Understanding DistributedSampler and DataLoader drop_last

ElijahZh · July 14, 2024, 2:59am

Hi, I am confused about the parameter “drop_last” of DistributedSampler and DataLoader in ddp. Both have parameters drop_last. What is the best practice for these settings for training and validation datasets?

For training dataset:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, shuffle=True, drop_last=False)
train_loader = torch.utils.data.DataLoader(train_dataset,
                                                                   batch_size=batch_size_per_gpu,
                                                                   shuffle=(train_sampler is None),
                                                                   num_workers=workers_per_gpu,
                                                                   sampler=train_sampler,
                                                                   drop_last=True
                                                                   )

For validation dataset:

val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset, shuffle=False, drop_last=False)
val_loader = torch.utils.data.DataLoader(val_dataset,
                                                                 batch_size=batch_size_per_gpu,
                                                                 shuffle=False,
                                                                 num_workers=workers_per_gpu,
                                                                 sampler=val_sampler,
                                                                 drop_last=False
                                                                )

Are these the correct ways to set the DistributedSampler and DataLoader? If so, for the val_loader, the last batch could be not evenly divided by the number of gpu. It would be really helpful if someone can detail about the process.

Ravi_Teja2 · July 14, 2024, 1:41pm

Hey, I’m also bit new to distributed training. I will try to explain my best:

Working of Distributed sampler:
Lets say we have 41 samples in our dataset and we have 2 GPU’s.
(Lets say we use DDP)
When we use distributed sampler then it divides the number of samples for each GPU. So that the data seen for model present on each GPU is different.
It divides the indexes of the data(Tries to split it evenly). if not it pads the indexes to divide them. But in our case since we cannot split 41 equally:
If I set drop_last = True(No padding)
rank: 1 = 20, rank: 2 = 20
If I set drop_last = False(Pad)
now there will be some duplicates.
the logic: pytorch/torch/utils/data/distributed.py at 81322aee7452cd081c8d10f0f27609c8ba1f4bb4 · pytorch/pytorch · GitHub

Function of Dataloader
It uses sampler to get indexes then uses collate_fn to create batches on our dataset.
Drop_last in Dataloader:
We mention batch_size to our dataloader to create batches.
Let’s continue our above case:
rank1: 20 samples, rank2: 20 samples.
If I set batch_size = 3
then when it creates batches in each rank, the last batch will have 2 samples in each rank.
If I set drop_last = True at dataloader:
then the last batch in each rank will be dropped.
If I set drop_last = False at dataloader:
then the last incomplete batch will be fetched.
I don’t think it pads here: pytorch/torch/utils/data/sampler.py at 5fe9515d35677649c9dfac53ab73b5a9d78afde8 · pytorch/pytorch · GitHub

Julian_Lehrer · July 14, 2024, 11:42pm

Drop last just means that the torch.utils.data.DataLoader will drop the last batch when you iterate over it. This might be useful if for example your last batch will have a batch size of 1 and your model has BatchNorm as normalization layers, causing it to error when there is only 1 sample.