Using NCCL and Gloo

Hi everyone,

Is it possible to use NCCL backend for training of a DDP model and a new group with Gloo backend to do gather operation for cpu tensors?

I’ll try to illustrate my use case since there might be a cleaner/easier solution for it:

  1. I have a DDP model, training it on N GPUs with nccl backend.
  2. I have attached some gradient hooks on the weight param of some layers, and I am storing these gradients in all processes.
  3. After some time, I would like to gather stored gradients from all processes in the main process to do some computation with it.

Since gather is not supported in nccl backend, I’ve tried to create a new group with gloo backend but for some reason the process hangs when it arrives at the: torch.distributed.gather(..., group=my_gloo_group).

Note: using all_gather in nccl is not an option because gradients are stored as cpu tensors. Using all_gather with gloo is not an option since storing world_size times this large “storage of gradients” is impossible.

Hi,

Could you share some code that reproduces the hang?

Hi @agolynski ,

I think I’ve managed to solve the issue. I didn’t know that I have to call torch.distributed.gather(...) in non-master processes. So the fix for the issue was basically changing this code snippet:

if torch.distributed.get_rank() == 0:
    gathered_data = [...]
    torch.distributed.gather(tensor=my_tensor, gather_list=gathered_data, group=gloo_group_handle)

to:

if torch.distributed.get_rank() == 0:
    gathered_data = [...]
    torch.distributed.gather(tensor=my_tensor, gather_list=gathered_data, group=gloo_group_handle)
else:
    torch.distributed.gather(tensor=my_tensor, group=gloo_group_handle)

I thought that non-master processes will somehow magically be pinged by the master process to send their tensors.

Thanks for the update!

Non master process needs to know which tensors to send to master hence you need to call gather there too