Passing tensors between GPUs just before loss calculation

Hi there,

I am currently using Distributed Data Parallel to achieve multi GPU training. So far, I did not need to send data across GPUs, because I could make use of the fact that in the backward pass the gradients are gathered from all GPUs before updating the models on the different GPUs automatically by the Distributed Data Parallel class.

However, now, I would like to extend / change my loss function with a calculation that requires the data on all GPUs. So, I would like to manually / explicitly exchange tensors between GPUs before calling backward. Let’s say the loss is calculated from two terms:

  1. One term, like a reconstruction loss term, that can be calculated on all GPUs separately and may be treated as normal, or additive in terms of gradients.
  2. A second term, like a statistic on all samples in batches across different GPUs, for which I would like to have the GPUs sync / exchange data, before doing a loss calculation and backward pass.

So, I need to know A) how to achieve this data passing for the second term, but B) also how to compute the loss and perform a backward pass with these two types of losses together.

Any help is much appreciated!



You could take a look at the Collective Communication docs which explain how tensors can be scattered, gathered, and reduced in a distributed setup.
Once you’ve made sure the data is available on the desired devices, I assume you can calculate the loss and let DDP compute the gradients etc.

agreed with @ptrblck. @ClaartjeBarkhof You can manually send the data across gpus with the communication ops provide by the distributed package, in this case you might consider using allreduce, scatter or reduce_scatter. If you also want to automatically support backward for those data communications, consider using the torch.distributed.nn to do those communications