Can DDP divide dataset unevenly to workers?

zizhao.mo · March 14, 2022, 1:11pm

Hi,

Is there any native method in DDP to divide a given dataset unevenly? For example, 60% of Cifar10 data are distributed to the first worker in each epoch while the other 40% are run through by another worker.

Thanks for any comments in advance.

rvarm1 · March 14, 2022, 10:32pm

Hi, there is no native support for this at the moment as DistributedSampler does not take in a custom sampler so it will always do the default which is divide evenly.

One option is to implement your own data sampling across workers and use DDP’s uneven inputs join context manager: Distributed Training with Uneven Inputs Using the Join Context Manager — PyTorch Tutorials 1.11.0+cu102 documentation to make sure DDP does not hang due to uneven dataset size.

Curious about your use case though, what is the aim of having 60/40 skew of data across workers?

zizhao.mo · March 15, 2022, 2:31am

Hi, thanks for your response.

My motivation is that, I want the faster GPU process more data and let the slower GPU process fewer data. The reason behind is the different capabilities across GPUs. For example, batch size = 60 costs GPU1 the same time as batch size = 40 for GPU2. For this reason, I want to divide the data 6:4 to alleviate the speed gap.

huahuanZ · March 15, 2022, 4:12am

Design a custom batch_sampler could make that.

# this is a super simple example, you should impl your own one.
class MyDistributedSampler(DistributedSampler):
    # customize the __iter__ method
    def __iter__(self):
        if self.rank == 0: # data 0-59 on node 0
            return [list(range(60))]
        if self.rank == 1: # data 60-99 on node 1
            return [list(range(60, 100))]

# use batch_sampler to init the dataloader
train_sampler = MyDistributedSampler(...)
trainloader = DataLoader(tr_set, batch_sampler=train_sampler, ...)

And note that you should do addtional normalization to the loss since batch sizes on nodes are not the same. See “Alternatives” at Support different batch size across GPUs with DDP · Issue #67253 · pytorch/pytorch (github.com)

zizhao.mo · March 15, 2022, 6:52am

Thank you very much. This is exactly what I am finding.