Will the gradients be collected correctly if I modify them manually with `tensor.register_hook` in DistributedDataParallel training?

I want to modify gradients manually with function tensor.register_hook().
I’ve tested that if training on one GPU, gradients will be collected correctly.
But I’m not sure if use DDP training, will the gradients also be collected correctly.

def modify_grad(self, g):
    return g * 3

feature.register_hook(self.modify_grad)

Say, if I changed the gradient on each GPU by multiplying 3, so 3 times of the original reduced gradient will be broadcast to each GPU?

This topic seems to be related to your question and also has a code example.

I did some test. Tensor.register_hook() does work in DDP. I first wrap my model with DDP, then I modify the grad manually with a custom function in the forward code where I construct my model:
image

Here’s the gradients after calling loss.backward():
If don’t modify the grad:
image
And if modify the grad:
image
You can see the gradients are just 3 times after being modified, which means .register_hook() works well in DDP.