Do gradients propagate through all_reduce & all_gather?

derJaeger · August 19, 2022, 1:52pm

Hi,
this is probably a very stupid question but I would like to make sure I am not making a mistake.
Some part of my loss needs to be computed based on ALL samples in the batch, not just the ones allocated to each GPU. I can aggregate the values I need with all_gather or all_reduce and then compute my final loss. Will the gradients of that loss then properly ‘travel back’ to each individual GPU through the all_gather/all_reduce operation?
Thanks!
Best,
Fabian

wanchaol · August 23, 2022, 4:55am

@derJaeger when you refer to “travel back”, do you mean the gradient flow back to each individual GPU? If so, the answer is not, it will not automatically flow the gradients back to each individual GPU samples if you use the c10d collective. Because currently c10d collective is not autograd enabled yet.

We are working on making the c10d collective autograd enabled (there’s a version of implementation that you can try to use and refer to in here, but it’s not publicly documented and it’s not been publicly released yet, not maintaining well either, so when using it please take your own risk (we might delete this in the future release and make the c10d collective directly autograd enabled). If you want to use it, I recommend you refer to this implementation and write your own version.

derJaeger · August 23, 2022, 7:04am

Dear @wanchaol , thanks so much for your response! Great to hear you are on it I have conducted some more research and believe I found what I am looking for:

github.com

Lightning-AI/lightning/blob/ab59f308b18622421edc67048d3b9fbfde96a9f4/src/pytorch_lightning/utilities/distributed.py#L143


      
              # sync all processes before reduction
              torch.distributed.barrier(group=group)
              torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
          
          
    if divide_by_world_size:
                  result = result / torch.distributed.get_world_size(group)
          
          
    return result
          
          

          
class AllGatherGrad(torch.autograd.Function):
              @staticmethod
              def forward(
                  ctx: Any,
                  tensor: Tensor,
                  group: Optional["torch.distributed.ProcessGroup"] = group.WORLD,
              ) -> Tensor:
                  ctx.group = group
          
          
        gathered_tensor = [torch.zeros_like(tensor) for _ in range(torch.distributed.get_world_size())]

I must confess I am not super knowledgeable about how gradients have to be handled in that situation but this implementation appears to do what I need it to. Would you agree?
Best,
Fabian

vikigenius · July 27, 2023, 12:58am

Hi, I am wondering is this implemented now I see that AllGather does have a backward implementation now in the main branch.

github.com

pytorch/pytorch/blob/5a114f72bf0aa93268ba51707a20f99b79ed048d/torch/distributed/nn/functional.py#L316


      
                  tensor = tensor.contiguous()
                  input_tensor_list = tuple(t.contiguous() for t in input_tensor_list)
                  dist.reduce_scatter(tensor, list(input_tensor_list), op=op, group=group)
                  return tensor
          
              @staticmethod
              def backward(ctx, grad_output):
                  return (None, None, None) + _AllGather.apply(ctx.group, grad_output)
          
          
          class _AllGather(Function):
              @staticmethod
              def forward(ctx, group, tensor):
                  # Need contiguous tensors for collectives.
                  tensor = tensor.contiguous()
          
                  ctx.group = group
                  out_tensor_list = [
                      torch.empty_like(tensor) for _ in range(dist.get_world_size(group=group))
                  ]

Does this mean that the gradients flow back to each GPU ?

fduwjj · July 31, 2023, 6:00pm

yes, the backward of all-gather will be reduce-scatter.