Hey,
Is there any easy way to accumulate gradients in a DistributedDataParallel model?
From what I see the only way to do this would be to copy gradients to a separate buffer before the next forward/backward?
Any plans on adding functionality for this to Pytorch? DataParallel gives too much overhead for me otherwise I would use that.