Is there any example of how to calculate the loss within multiple GPU and merge all of them later after the calculation?
Currently, we could calculate the output from a network by using DistributedDataParalel. However, the result from DistributedDataParallel is collected in device 0. Therefore, the calculation was done in 1 GPU only instead of multi-GPU.
What do you mean exactly here? Are you looking to compute only a single loss value for a model that gets executed on multiple processes? Or just on multiple GPUs from a single process?
Even though the result is collected in GPU 0, the gradients will propagate back through the activations on all GPUs that contributed in computing the final loss value. The gradients that are computed for every replica are averaged automatically by
torch.nn.parallel.DistributedDataParallel, so all replicas contribute to the gradient that is later used by the optimizer to update your model weights.
I am facing the same problem, have you find any proper solution?