Accumulate gradients in DDP

Hey,

Is there any easy way to accumulate gradients in a DistributedDataParallel model?
From what I see the only way to do this would be to copy gradients to a separate buffer before the next forward/backward?

Any plans on adding functionality for this to Pytorch? DataParallel gives too much overhead for me otherwise I would use that.

This was merged very recently in https://github.com/pytorch/pytorch/pull/21736.