Hi,
this is probably a very stupid question but I would like to make sure I am not making a mistake.
Some part of my loss needs to be computed based on ALL samples in the batch, not just the ones allocated to each GPU. I can aggregate the values I need with all_gather or all_reduce and then compute my final loss. Will the gradients of that loss then properly ‘travel back’ to each individual GPU through the all_gather/all_reduce operation?
Thanks!
Best,
Fabian
@derJaeger when you refer to “travel back”, do you mean the gradient flow back to each individual GPU? If so, the answer is not, it will not automatically flow the gradients back to each individual GPU samples if you use the c10d collective. Because currently c10d collective is not autograd enabled yet.
We are working on making the c10d collective autograd enabled (there’s a version of implementation that you can try to use and refer to in here, but it’s not publicly documented and it’s not been publicly released yet, not maintaining well either, so when using it please take your own risk (we might delete this in the future release and make the c10d collective directly autograd enabled). If you want to use it, I recommend you refer to this implementation and write your own version.
Dear @wanchaol , thanks so much for your response! Great to hear you are on it I have conducted some more research and believe I found what I am looking for:
I must confess I am not super knowledgeable about how gradients have to be handled in that situation but this implementation appears to do what I need it to. Would you agree?
Best,
Fabian
Hi, I am wondering is this implemented now I see that AllGather does have a backward implementation now in the main branch.
Does this mean that the gradients flow back to each GPU ?
yes, the backward of all-gather will be reduce-scatter.