GPU1 GPU2 GPU3
model model model
dataloader1 dataloader2 dataloader3
This is possible and is actually the recommended use case for DDP. For dataloader, you can use the DistributedSampler and set num_replicas properly to match DDP world size.
I want to backpropagation by averaging the loss values of each dataloader.
This is a bit tricky. With DDP, outputs and losses are local to each process. DDP synchronizes models by synchronizing gradients. More details are available in this PyTorch Distributed paper.
Each data loader has different image size and batch size, so concatDataset cannot be used.
I see. Independent dataloaders could still work with DDP, but you will need to be aware of the implications on model accuracy. Is this the reason why you have to average loss instead of gradients?
I just want to use various loss values using various image size and batch size.
If this is all you need, you don’t need to average loss. You can let each DDP process creates its own independent data loader and then produce a local loss. Then, during the backward pass, DDP will synchronize gradients for you.
It should work, but will have different accuracy implications to different applications, depending on your model, data, batch size, loss function, etc. I would suggest taking a look at the DDP paper or at lest the design notes and verify whether DDP’s gradient averaging algorithm is OK for your application.