Is it possible to use NCCL backend for training of a DDP model and a new group with Gloo backend to do gather operation for cpu tensors?
I’ll try to illustrate my use case since there might be a cleaner/easier solution for it:
I have a DDP model, training it on N GPUs with nccl backend.
I have attached some gradient hooks on the weight param of some layers, and I am storing these gradients in all processes.
After some time, I would like to gather stored gradients from all processes in the main process to do some computation with it.
Since gather is not supported in nccl backend, I’ve tried to create a new group with gloo backend but for some reason the process hangs when it arrives at the: torch.distributed.gather(..., group=my_gloo_group).
Note: using all_gather in nccl is not an option because gradients are stored as cpu tensors. Using all_gather with gloo is not an option since storing world_size times this large “storage of gradients” is impossible.
I think I’ve managed to solve the issue. I didn’t know that I have to call torch.distributed.gather(...) in non-master processes. So the fix for the issue was basically changing this code snippet:
if torch.distributed.get_rank() == 0:
gathered_data = [...]
torch.distributed.gather(tensor=my_tensor, gather_list=gathered_data, group=gloo_group_handle)