How can I train a model with multiple data loaders using DistributedDataParallel?

ysleer · March 3, 2021, 3:26pm

HI, Thank you for your interest in my writing.

I’ve seen a lot of posts on the pytorch forum but couldn’t find a solution to what I was wondering about.

Can 3 dataloaders be applied to DistributedDataParallel??

The process I think is as follows.

GPU1 GPU2 GPU3
model model model
dataloader1 dataloader2 dataloader3

I want to backpropagation by averaging the loss values of each dataloader.

Are there any possible methods or examples??

Each data loader has different image size and batch size, so concatDataset cannot be used.

If there is a way, please comment.

Thank you.

mrshenli · March 17, 2021, 7:15pm

The process I think is as follows.

GPU1 GPU2 GPU3
model model model
dataloader1 dataloader2 dataloader3

This is possible and is actually the recommended use case for DDP. For dataloader, you can use the DistributedSampler and set num_replicas properly to match DDP world size.

I want to backpropagation by averaging the loss values of each dataloader.

This is a bit tricky. With DDP, outputs and losses are local to each process. DDP synchronizes models by synchronizing gradients. More details are available in this PyTorch Distributed paper.

Each data loader has different image size and batch size, so concatDataset cannot be used.

I see. Independent dataloaders could still work with DDP, but you will need to be aware of the implications on model accuracy. Is this the reason why you have to average loss instead of gradients?

ysleer · March 18, 2021, 2:14am

Thanks for the answer.

sorry.

I didn’t understand your answer to “Is this the reason why you have to average loss instead of gradients?”

My question doesn’t seem to be clear.

To be more simply, I want to training one model using multiple dataloader.

All I want to do is update using the average of the losses calculated from multiple dataloader.

I just want to use various loss values using various image size and batch size.

Please tell me what you want to know specifically about my question.

mrshenli · March 18, 2021, 2:29am

I just want to use various loss values using various image size and batch size.

If this is all you need, you don’t need to average loss. You can let each DDP process creates its own independent data loader and then produce a local loss. Then, during the backward pass, DDP will synchronize gradients for you.

ysleer · March 18, 2021, 3:12am

Okay.

Do you have any examples or documentation for reference?

Thank you.

mrshenli · March 18, 2021, 2:22pm

Yep, here is a starter example: Distributed Data Parallel — PyTorch 1.8.0 documentation

You can replace the torch.randn(20, 10).to(rank) random input tensor by input and labels from a dataloader example

Here is a complete list of DDP tutorials: PyTorch Distributed Overview — PyTorch Tutorials 1.8.0 documentation

It should work, but will have different accuracy implications to different applications, depending on your model, data, batch size, loss function, etc. I would suggest taking a look at the DDP paper or at lest the design notes and verify whether DDP’s gradient averaging algorithm is OK for your application.