What Dataset/DataLoader for DDP to train on sharded local dataset?


I’m working with a system (Amazon SageMaker Training) capable of spreading remote files homogenously across machines. Meaning that at the beginning of a distributed DDP training, I have the files of my dataset (eg images, text files) homogeneously spread on the disks of each DDP nodes, as represented in the diagram below

What Dataset & DataLoader settings shall I use to train DDP over this distributed dataset?

Most DDP example I saw work do a virtual sharding with the DistributedSampler in the DataLoader, but here my data is already phyiscally distributed…

cc @VitalyFedyunin for dataloader

The DistributedSampler would create chunks using the dataset indices so that each worker only gets its corresponding subset. If you are using a local dataset on each worker, I would assume you should be able to load only this particular dataset in each process?

It feels like you can use the regular sampler, but you need to make sure that ALL datasets are equal-sized, otherwise, your distributed training will hang.