I want to modify gradients manually with function
I’ve tested that if training on one GPU, gradients will be collected correctly.
But I’m not sure if use DDP training, will the gradients also be collected correctly.
def modify_grad(self, g):
return g * 3
Say, if I changed the gradient on each GPU by multiplying 3, so 3 times of the original reduced gradient will be broadcast to each GPU?
This topic seems to be related to your question and also has a code example.
I did some test.
Tensor.register_hook() does work in DDP. I first wrap my model with DDP, then I modify the grad manually with a custom function in the
forward code where I construct my model:
Here’s the gradients after calling
If don’t modify the grad:
And if modify the grad:
You can see the gradients are just 3 times after being modified, which means
.register_hook() works well in DDP.