Modifying the gradients after backward pass in DDP

anshul.singh · April 28, 2022, 8:32am

Hello,
I’m trying to implement an optimizer algorithm that involves modifications in the gradients after the backward pass. When I replace the old gradients with the modified ones after the backward pass, Is it correct to assume that the values will get updated on the all the processes running across all the GPUs under DDP or will the updates take place only for one process?

cbalioglu · May 2, 2022, 6:32pm

Hey @anshul.singh, it depends on what you mean by “backward pass”? If you mean once all layers of your model have gone through the backprop, then your optimizer should be run on all ranks since gradients are gradually synced among ranks during backprop. Having said that I would suggest checking out our DDP communication hooks, which might be a better fit for your use case:
https://pytorch.org/docs/stable/ddp_comm_hooks.html

anshul.singh · May 4, 2022, 12:22pm

Thank you @cbalioglu for your response. I’m still in need of some more clarification, I’ll rephrase my question to convey it better.

I’m running the DDP training on 3GPUs, with a batch size of 8. What I understand is that when I do a forward pass and calculate loss, each of the three process on GPUs will have their own losses. And when I do backward pass (all layers), the gradients on each processes are calculated using their corresponding losses but they all get synced by the end of the backward pass.

Now all GPUs have the same gradients. I make a copy of the gradients and make some modifications and replace the original values with the modified ones. Will this change only happen on one GPU or on all the GPUs?