The correct way to deal with it is to average your loss values - torch.mean(loss) - for each GPU you get a loss value and averaging the loss is the same as averaging the gradients from my point of view.
Here a post where they also talk about that https://discuss.pytorch.org/t/is-average-the-correct-way-for-the-gradient-in-distributeddataparallel-with-multi-nodes/34260