How to average gradients when use DataParallel?

By default, DataParallel sums all gradients. How to average gradients when use DataParallel?

The correct way to deal with it is to average your loss values - torch.mean(loss) - for each GPU you get a loss value and averaging the loss is the same as averaging the gradients from my point of view.
Here a post where they also talk about that https://discuss.pytorch.org/t/is-average-the-correct-way-for-the-gradient-in-distributeddataparallel-with-multi-nodes/34260