I want to modify gradients manually with function tensor.register_hook()
.
I’ve tested that if training on one GPU, gradients will be collected correctly.
But I’m not sure if use DDP training, will the gradients also be collected correctly.
def modify_grad(self, g):
return g * 3
feature.register_hook(self.modify_grad)
Say, if I changed the gradient on each GPU by multiplying 3, so 3 times of the original reduced gradient will be broadcast to each GPU?
This topic seems to be related to your question and also has a code example.
I did some test. Tensor.register_hook()
does work in DDP. I first wrap my model with DDP, then I modify the grad manually with a custom function in the forward
code where I construct my model:
Here’s the gradients after calling loss.backward()
:
If don’t modify the grad:
And if modify the grad:
You can see the gradients are just 3 times after being modified, which means .register_hook()
works well in DDP.