Inputting dataloader and model to multiprocessing.spawn

marioo · July 7, 2021, 9:58am

Hi,

I am exploring the use of DistributedDataParallel to train on two GPUs. In all the examples I have found the DataLoader and Model are instanciated separately at each rank. Can I create the model and dataloader outside of the multiprocessing.spaw function and pass them as input arguments to multiprocessing.spawn? I mean something like this:

import torch.multiprocessing as mp

loader = DataLoader(dataset, batch_size=128, shuffle=True)
model = MyNet()

if __name__ == '__main__':
    mp.spawn(fit, args=(model, devices, loader), nprocs=len(devices), join=True)

In this case, will be a new model and dataloader created at each rank without shared memory?
I would like to use the same iterable for the dataloader at each rank so each GPU works on the same epoch. Is this possible?

I know that it would be better to use a DistributtedSampler, but I am working with graphs and I cannot do so due to the irregular structure of my data. If you know other similar option I would be nice to hear.

rvarm1 · July 9, 2021, 4:01am

I think that the model’s parameter tensors will have their data moved to shared memory as per Multiprocessing best practices — PyTorch 1.9.0 documentation, so you’d essentially be doing Hogwild training and this could cause issues with DistributedDataParallel as usually the model is instantiated individually on each rank. Is there a reason you can’t simply create the model within each subprocess?

Regarding the question about the data loader, I think if the dataloader is serializable this should work, but it would result in each worker training on the same data, and thus DDP would not really help as you’re using the same data and different model. Is it possible for you to do something like pass in a file handle/file name to each worker, and possibly an index/offset so you can shard your data across multiple workers?