How does gradient averaging work in DistributedDataParallel training? I am particularly interested in what happens when the batches have masked or ignored data, e.g. with semantic segmentation.
For example: let’s say I have 4 GPUs and I am training a semantic segmentation network with a dataset with an ignore class. As I understand it, in the DataParallel setting, the outputs are aggregated on GPU0, the loss computed, and then the gradient is backpropagated back through each GPU’s model. In the DistributedDataParallel case, L0, L1, L2, L3 are each computed for each GPU’s share of the batch, the losses are backpropagated back through their respective GPU’s model, and the gradients along the way are averaged.
Using DataParallel, the presence of an ignore class makes no difference. Even if one GPU’s mini-batch has a lopsided amount of ignore pixels, the loss is computed as the weighted average. However, what happens when you have a lopsided distribution of ignore pixels on one GPU using DistributedDataParallel? There does not seem to be any mechanism for weighting the average of the gradients. Yet in this case, L0, L1, L2, and L3 ought to have their contributions weighted by the ratio of valid pixels when averaging gradients during backpropagation.
Is there some way to handle this ignore class imbalance during distributed training?